# Context specific (static) gene regulatory network (GRN) inference
This notebooks runs the context specific GRN inference pipeline. It uses the input files prepared in `data` folder and saves context specific GRNs into a single file `output/static.h5`.

The pipeline consists of several parts for platform (CPU of GPU) dependent optimization of computation speed. Each part is organized into two sections in this notebook:
1. The execution section runs this part for network inference. You can see the commands and the output of each command.
2. The command description section explains each command involved in this part.

Depending on your mode of computation for pytorch (defined by `DEVICE` in `makefiles/config.mk`), each part may include more steps or should be totally skipped. **Read the instructions of each part carefully.** If you cannot find the description of a command in one part, search other parts.

**You should change** `-j 32` option for `make` in every part to the number of parallel processes suitable for your machine. **It should be different** for CPU and GPU parts. Here the maximum number of cores used is `32*NTH=128`. `NTH` is set in `makefiles/config.mk`.

## 1.1 CPU part 1, execution
Here GPU is used to speed up computation. Therefore this part only infers the TF binding network that serves as a constraint for GRN inference.

If you encounter errors from `make` like `No rule to make target...` or `Target 'cpu' not remade because of errors`, they can be safely ignored because these targets will be produced in CPU part 2. If you see other errors especially in peak or footprint detection steps due to low cell count for select cell subsets (typically several hundreds or lower), these errors can also be ignored and these cell subsets will be removed in the reconstructed networks.

In [1]:
# Removes CPU usage limit by some jupyter versions
import os
os.environ['KMP_AFFINITY'] = ''


In [1]:
%%bash
set -eo pipefail
cd ..
#Run CPU part of inference
make -f makefiles/static.mk -j 32 -k cpu || true


mkdir -p tmp_static/Subset1/


make: *** No rule to make target `tmp_static/Subset1/net_weight.tsv.gz', needed by `tmp_static/Subset1/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset1/net_meanvar.tsv.gz', needed by `tmp_static/Subset1/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset1/net_covfactor.tsv.gz', needed by `tmp_static/Subset1/net_nweight.tsv.gz'.


cp data/subsets/Subset1/names_rna.txt tmp_static/Subset1/names_rna.txt
mkdir -p tmp_static/Subset10/


make: *** No rule to make target `tmp_static/Subset10/net_weight.tsv.gz', needed by `tmp_static/Subset10/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset10/net_meanvar.tsv.gz', needed by `tmp_static/Subset10/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset10/net_covfactor.tsv.gz', needed by `tmp_static/Subset10/net_nweight.tsv.gz'.


mkdir -p tmp_static/Subset11/


make: *** No rule to make target `tmp_static/Subset11/net_weight.tsv.gz', needed by `tmp_static/Subset11/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset11/net_meanvar.tsv.gz', needed by `tmp_static/Subset11/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset11/net_covfactor.tsv.gz', needed by `tmp_static/Subset11/net_nweight.tsv.gz'.


cp data/subsets/Subset10/names_rna.txt tmp_static/Subset10/names_rna.txt
cp data/subsets/Subset11/names_rna.txt tmp_static/Subset11/names_rna.txt
mkdir -p tmp_static/Subset12/
cp data/subsets/Subset12/names_rna.txt tmp_static/Subset12/names_rna.txt
mkdir -p tmp_static/Subset13/


make: *** No rule to make target `tmp_static/Subset12/net_weight.tsv.gz', needed by `tmp_static/Subset12/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset12/net_meanvar.tsv.gz', needed by `tmp_static/Subset12/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset12/net_covfactor.tsv.gz', needed by `tmp_static/Subset12/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset13/net_weight.tsv.gz', needed by `tmp_static/Subset13/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset13/net_meanvar.tsv.gz', needed by `tmp_static/Subset13/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset13/net_covfactor.tsv.gz', needed by `tmp_static/Subset13/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset14/net_weight.tsv.gz', needed by `tmp_static/Subset14/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset14/net_meanvar.tsv.gz', needed by `tmp_static/Subset14/net_

cp data/subsets/Subset13/names_rna.txt tmp_static/Subset13/names_rna.txt
mkdir -p tmp_static/Subset14/
cp data/subsets/Subset14/names_rna.txt tmp_static/Subset14/names_rna.txt
mkdir -p tmp_static/Subset2/
cp data/subsets/Subset2/names_rna.txt tmp_static/Subset2/names_rna.txt
mkdir -p tmp_static/Subset3/


make: *** No rule to make target `tmp_static/Subset2/net_weight.tsv.gz', needed by `tmp_static/Subset2/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset2/net_meanvar.tsv.gz', needed by `tmp_static/Subset2/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset2/net_covfactor.tsv.gz', needed by `tmp_static/Subset2/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset3/net_weight.tsv.gz', needed by `tmp_static/Subset3/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset3/net_meanvar.tsv.gz', needed by `tmp_static/Subset3/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset3/net_covfactor.tsv.gz', needed by `tmp_static/Subset3/net_nweight.tsv.gz'.


cp data/subsets/Subset3/names_rna.txt tmp_static/Subset3/names_rna.txt
mkdir -p tmp_static/Subset4/
cp data/subsets/Subset4/names_rna.txt tmp_static/Subset4/names_rna.txt
mkdir -p tmp_static/Subset5/
mkdir -p tmp_static/Subset6/


make: *** No rule to make target `tmp_static/Subset4/net_weight.tsv.gz', needed by `tmp_static/Subset4/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset4/net_meanvar.tsv.gz', needed by `tmp_static/Subset4/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset4/net_covfactor.tsv.gz', needed by `tmp_static/Subset4/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset5/net_weight.tsv.gz', needed by `tmp_static/Subset5/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset5/net_meanvar.tsv.gz', needed by `tmp_static/Subset5/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset5/net_covfactor.tsv.gz', needed by `tmp_static/Subset5/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset6/net_weight.tsv.gz', needed by `tmp_static/Subset6/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset6/net_meanvar.tsv.gz', needed by `tmp_static/Subset6/net_nweight.tsv.gz'.

cp data/subsets/Subset5/names_rna.txt tmp_static/Subset5/names_rna.txt
cp data/subsets/Subset6/names_rna.txt tmp_static/Subset6/names_rna.txt
mkdir -p tmp_static/Subset7/
cp data/subsets/Subset7/names_rna.txt tmp_static/Subset7/names_rna.txt
mkdir -p tmp_static/Subset8/


make: *** No rule to make target `tmp_static/Subset8/net_covfactor.tsv.gz', needed by `tmp_static/Subset8/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset9/net_weight.tsv.gz', needed by `tmp_static/Subset9/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset9/net_meanvar.tsv.gz', needed by `tmp_static/Subset9/net_nweight.tsv.gz'.
make: *** No rule to make target `tmp_static/Subset9/net_covfactor.tsv.gz', needed by `tmp_static/Subset9/net_nweight.tsv.gz'.


cp data/subsets/Subset8/names_rna.txt tmp_static/Subset8/names_rna.txt
mkdir -p tmp_static/Subset9/
cp data/subsets/Subset9/names_rna.txt tmp_static/Subset9/names_rna.txt
cp tmp_static/Subset1/names_rna.txt tmp_static/Subset1/names_atac0.txt
python3 -m dictys  preproc selects_rna  data/expression.tsv.gz tmp_static/Subset1/names_rna.txt tmp_static/Subset1/expression0.tsv.gz
cp tmp_static/Subset10/names_rna.txt tmp_static/Subset10/names_atac0.txt
python3 -m dictys  preproc selects_rna  data/expression.tsv.gz tmp_static/Subset10/names_rna.txt tmp_static/Subset10/expression0.tsv.gz
cp tmp_static/Subset11/names_rna.txt tmp_static/Subset11/names_atac0.txt
python3 -m dictys  preproc selects_rna  data/expression.tsv.gz tmp_static/Subset11/names_rna.txt tmp_static/Subset11/expression0.tsv.gz
cp tmp_static/Subset12/names_rna.txt tmp_static/Subset12/names_atac0.txt
python3 -m dictys  preproc selects_rna  data/expression.tsv.gz tmp_static/Subset12/names_rna.txt tmp_static/Subset12/expression0.tsv.

python3 -m dictys  chromatin macs2 --nth 4 tmp_static/Subset4/names_atac.txt data/bams tmp_static/Subset4/reads.bam tmp_static/Subset4/reads.bai tmp_static/Subset4/peaks.bed hs
python3 -m dictys  preproc selects_atac  tmp_static/Subset3/expression.tsv.gz tmp_static/Subset3/names_atac0.txt tmp_static/Subset3/names_atac.txt
python3 -m dictys  preproc selects_atac  tmp_static/Subset2/expression.tsv.gz tmp_static/Subset2/names_atac0.txt tmp_static/Subset2/names_atac.txt
python3 -m dictys  chromatin macs2 --nth 4 tmp_static/Subset3/names_atac.txt data/bams tmp_static/Subset3/reads.bam tmp_static/Subset3/reads.bai tmp_static/Subset3/peaks.bed hs
python3 -m dictys  chromatin macs2 --nth 4 tmp_static/Subset2/names_atac.txt data/bams tmp_static/Subset2/reads.bam tmp_static/Subset2/reads.bai tmp_static/Subset2/peaks.bed hs
python3 -m dictys  preproc selects_atac  tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/names_atac0.txt tmp_static/Subset1/names_atac.txt
python3 -m dictys  chromatin

INFO  @ Wed, 24 Aug 2022 13:20:11: #2 Use 150 as fragment length 
INFO  @ Wed, 24 Aug 2022 13:20:11: #2 Sequencing ends will be shifted towards 5' by 75 bp(s) 
INFO  @ Wed, 24 Aug 2022 13:20:11: #3 Call peaks... 
INFO  @ Wed, 24 Aug 2022 13:20:11: #3 Going to call summits inside each peak ... 
INFO  @ Wed, 24 Aug 2022 13:20:11: #3 Pre-compute pvalue-qvalue table... 
DEBUG @ Wed, 24 Aug 2022 13:20:11: Start to calculate pvalue stat... 
DEBUG @ Wed, 24 Aug 2022 13:20:44: access pq hash for 32888100 times 
INFO  @ Wed, 24 Aug 2022 13:20:44: #3 Call peaks for each chromosome... 
INFO  @ Wed, 24 Aug 2022 13:21:45: #4 Write output xls file... 04_peaks.xls 
INFO  @ Wed, 24 Aug 2022 13:21:45: #4 Write peak in narrowPeak format file... 04_peaks.narrowPeak 
INFO  @ Wed, 24 Aug 2022 13:21:46: #4 Write summits bed file... 04_summits.bed 
INFO  @ Wed, 24 Aug 2022 13:21:46: Done! 

python3 -m dictys  chromatin wellington --nth 4 tmp_static/Subset12/reads.bam tmp_static/Subset12/reads.bai tmp_static/

DEBUG @ Wed, 24 Aug 2022 13:22:26: Start to calculate pvalue stat... 
DEBUG @ Wed, 24 Aug 2022 13:23:49: access pq hash for 48797001 times 
INFO  @ Wed, 24 Aug 2022 13:23:49: #3 Call peaks for each chromosome... 
INFO  @ Wed, 24 Aug 2022 13:24:44: #4 Write output xls file... 04_peaks.xls 
INFO  @ Wed, 24 Aug 2022 13:24:45: #4 Write peak in narrowPeak format file... 04_peaks.narrowPeak 
INFO  @ Wed, 24 Aug 2022 13:24:45: #4 Write summits bed file... 04_summits.bed 
INFO  @ Wed, 24 Aug 2022 13:24:45: Done! 

python3 -m dictys  chromatin wellington --nth 4 tmp_static/Subset7/reads.bam tmp_static/Subset7/reads.bai tmp_static/Subset7/peaks.bed tmp_static/Subset7/footprints.bed
[bam_sort_core] merging from 28 files and 4 in-memory blocks...
INFO  @ Wed, 24 Aug 2022 13:20:43: 
# Command line: callpeak -t tmp_static/Subset4/reads.bam -n 04 -g hs --nomodel --shift -75 --extsize 150 --keep-dup all --verbose 4 --call-summits -q 0.05
# ARGUMENTS LIST:
# name = 04
# format = AUTO
# ChIP-seq file = 

INFO  @ Wed, 24 Aug 2022 13:23:56:  24000000 
INFO  @ Wed, 24 Aug 2022 13:23:59:  25000000 
INFO  @ Wed, 24 Aug 2022 13:24:02:  26000000 
INFO  @ Wed, 24 Aug 2022 13:24:05:  27000000 
INFO  @ Wed, 24 Aug 2022 13:24:08:  28000000 
INFO  @ Wed, 24 Aug 2022 13:24:12:  29000000 
INFO  @ Wed, 24 Aug 2022 13:24:15:  30000000 
INFO  @ Wed, 24 Aug 2022 13:24:18:  31000000 
INFO  @ Wed, 24 Aug 2022 13:24:21:  32000000 
INFO  @ Wed, 24 Aug 2022 13:24:24:  33000000 
INFO  @ Wed, 24 Aug 2022 13:24:27:  34000000 
INFO  @ Wed, 24 Aug 2022 13:24:31:  35000000 
INFO  @ Wed, 24 Aug 2022 13:24:34:  36000000 
INFO  @ Wed, 24 Aug 2022 13:24:37:  37000000 
INFO  @ Wed, 24 Aug 2022 13:24:40:  38000000 
INFO  @ Wed, 24 Aug 2022 13:24:43:  39000000 
INFO  @ Wed, 24 Aug 2022 13:24:46:  40000000 
INFO  @ Wed, 24 Aug 2022 13:24:49:  41000000 
INFO  @ Wed, 24 Aug 2022 13:24:52:  42000000 
INFO  @ Wed, 24 Aug 2022 13:24:56:  43000000 
INFO  @ Wed, 24 Aug 2022 13:24:59:  44000000 
INFO  @ Wed, 24 Aug 2022 13:25:02:


	Peak File Statistics:
		Total Peaks: 8210
		Redundant Peak IDs: 0
		Peaks lacking information: 0 (need at least 5 columns per peak)
		Peaks with misformatted coordinates: 0 (should be integer)
		Peaks with misformatted strand: 0 (should be either +/- or 0/1)

	Peak file looks good!

	Background fragment size set to 20 (avg size of targets)
	Background files for 20 bp fragments found.
	Custom genome sequence directory: data/genome

	Extracting sequences from file: data/genome/genome.fa
	Looking for peak sequences in a single file (data/genome/genome.fa)
	Extracting 874 sequences from chr1
	Extracting 277 sequences from chr10
	Extracting 420 sequences from chr11
	Extracting 377 sequences from chr12
	Extracting 115 sequences from chr13
	Extracting 258 sequences from chr14
	Extracting 188 sequences from chr15
	Extracting 416 sequences from chr16
	Extracting 584 sequences from chr17
	Extracting 94 sequences from chr18
	Extracting 959 sequences from chr19
	Extracting 463 sequences from chr

		BED/Header formatted lines: 25001
		peakfile formatted lines: 0

	Peak File Statistics:
		Total Peaks: 25001
		Redundant Peak IDs: 0
		Peaks lacking information: 0 (need at least 5 columns per peak)
		Peaks with misformatted coordinates: 0 (should be integer)
		Peaks with misformatted strand: 0 (should be either +/- or 0/1)

	Peak file looks good!

	Background fragment size set to 19 (avg size of targets)
	Background files for 19 bp fragments found.
	Custom genome sequence directory: data/genome

	Extracting sequences from file: data/genome/genome.fa
	Looking for peak sequences in a single file (data/genome/genome.fa)
	Extracting 2622 sequences from chr1
	Extracting 1005 sequences from chr10
	Extracting 1201 sequences from chr11
	Extracting 1271 sequences from chr12
	Extracting 474 sequences from chr13
	Extracting 870 sequences from chr14
	Extracting 822 sequences from chr15
	Extracting 1148 sequences from chr16
	Extracting 1677 sequences from chr17
	Extracting 397 sequences from chr

	Extracting 2552 sequences from chr1
	Extracting 1003 sequences from chr10
	Extracting 1283 sequences from chr11
	Extracting 1236 sequences from chr12
	Extracting 420 sequences from chr13
	Extracting 818 sequences from chr14
	Extracting 694 sequences from chr15
	Extracting 1217 sequences from chr16
	Extracting 1714 sequences from chr17
	Extracting 369 sequences from chr18
	Extracting 2258 sequences from chr19
	Extracting 1640 sequences from chr2
	Extracting 663 sequences from chr20
	Extracting 498 sequences from chr21
	Extracting 584 sequences from chr22
	Extracting 1359 sequences from chr3
	Extracting 831 sequences from chr4
	Extracting 1018 sequences from chr5
	Extracting 1294 sequences from chr6
	Extracting 1209 sequences from chr7
	Extracting 832 sequences from chr8
	Extracting 1065 sequences from chr9
	Extracting 400 sequences from chrX


	Reading input files...
	24957 total sequences read
	769 motifs loaded
	Finding instances of 769 motif(s)
	|0%                                  

	Extracting 2612 sequences from chr1
	Extracting 941 sequences from chr10
	Extracting 1276 sequences from chr11
	Extracting 1247 sequences from chr12
	Extracting 442 sequences from chr13
	Extracting 846 sequences from chr14
	Extracting 742 sequences from chr15
	Extracting 1205 sequences from chr16
	Extracting 1711 sequences from chr17
	Extracting 371 sequences from chr18
	Extracting 2075 sequences from chr19
	Extracting 1685 sequences from chr2
	Extracting 703 sequences from chr20
	Extracting 326 sequences from chr21
	Extracting 645 sequences from chr22
	Extracting 1274 sequences from chr3
	Extracting 780 sequences from chr4
	Extracting 1078 sequences from chr5
	Extracting 1407 sequences from chr6
	Extracting 1164 sequences from chr7
	Extracting 809 sequences from chr8
	Extracting 1040 sequences from chr9
	Extracting 577 sequences from chrX
	Extracting 2 sequences from chrY


	Reading input files...
	24958 total sequences read
	769 motifs loaded
	Finding instances of 769 motif(s)
	|0% 

	Extracting 386 sequences from chrX
	Extracting 2 sequences from chrY


	Reading input files...
	24851 total sequences read
	769 motifs loaded
	Finding instances of 769 motif(s)
	|0%                                    50%                                  100%|
	Cleaning up tmp files...

Reading BED File...
Calculating footprints...
Waiting for the last 30 jobs to finish...

python3 -m dictys  chromatin homer --nth 4 tmp_static/Subset4/footprints.bed data/motifs.motif data/genome tmp_static/Subset4/expression.tsv.gz tmp_static/Subset4/motifs.bed tmp_static/Subset4/wellington.tsv.gz tmp_static/Subset4/homer.tsv.gz
Reading BED File...
Calculating footprints...
Waiting for the last 30 jobs to finish...

python3 -m dictys  chromatin homer --nth 4 tmp_static/Subset12/footprints.bed data/motifs.motif data/genome tmp_static/Subset12/expression.tsv.gz tmp_static/Subset12/motifs.bed tmp_static/Subset12/wellington.tsv.gz tmp_static/Subset12/homer.tsv.gz

python3 -m dictys  chromatin binding  tmp_

	Genome = data/genome
	Output Directory = 15-motifscan/aaaac
	Using actual sizes of regions (-size given)
	Fragment size set to given
	Will use repeat masked sequences
	Will find motif(s) in data/motifs.motif
	Using Custom Genome
	Peak/BED file conversion summary:
		BED/Header formatted lines: 25001
		peakfile formatted lines: 0

	Peak File Statistics:
		Total Peaks: 25001
		Redundant Peak IDs: 0
		Peaks lacking information: 0 (need at least 5 columns per peak)
		Peaks with misformatted coordinates: 0 (should be integer)
		Peaks with misformatted strand: 0 (should be either +/- or 0/1)

	Peak file looks good!

	Background fragment size set to 19 (avg size of targets)
	Background files for 19 bp fragments found.
	Custom genome sequence directory: data/genome

	Extracting sequences from file: data/genome/genome.fa
	Looking for peak sequences in a single file (data/genome/genome.fa)
	Extracting 2506 sequences from chr1
	Extracting 1091 sequences from chr10
	Extracting 1294 sequences from 

python3 -m dictys  chromatin linking  tmp_static/Subset10/binding.tsv.gz tmp_static/Subset10/tssdist.tsv.gz tmp_static/Subset10/linking.tsv.gz
python3 -m dictys  chromatin binlinking  tmp_static/Subset10/linking.tsv.gz tmp_static/Subset10/binlinking.tsv.gz 20
Reading BED File...
Calculating footprints...
Waiting for the last 30 jobs to finish...

python3 -m dictys  chromatin homer --nth 4 tmp_static/Subset8/footprints.bed data/motifs.motif data/genome tmp_static/Subset8/expression.tsv.gz tmp_static/Subset8/motifs.bed tmp_static/Subset8/wellington.tsv.gz tmp_static/Subset8/homer.tsv.gz

	Position file = 14-reform-split/aaaab
	Genome = data/genome
	Output Directory = 15-motifscan/aaaab
	Using actual sizes of regions (-size given)
	Fragment size set to given
	Will use repeat masked sequences
	Will find motif(s) in data/motifs.motif
	Using Custom Genome
	Peak/BED file conversion summary:
		BED/Header formatted lines: 25001
		peakfile formatted lines: 0

	Peak File Statistics:
		Total Peaks

	Using actual sizes of regions (-size given)
	Fragment size set to given
	Will use repeat masked sequences
	Will find motif(s) in data/motifs.motif
	Using Custom Genome
	Peak/BED file conversion summary:
		BED/Header formatted lines: 25001
		peakfile formatted lines: 0

	Peak File Statistics:
		Total Peaks: 25001
		Redundant Peak IDs: 0
		Peaks lacking information: 0 (need at least 5 columns per peak)
		Peaks with misformatted coordinates: 0 (should be integer)
		Peaks with misformatted strand: 0 (should be either +/- or 0/1)

	Peak file looks good!

	Background fragment size set to 18 (avg size of targets)
	Background files for 18 bp fragments found.
	Custom genome sequence directory: data/genome

	Extracting sequences from file: data/genome/genome.fa
	Looking for peak sequences in a single file (data/genome/genome.fa)
	Extracting 2543 sequences from chr1
	Extracting 1050 sequences from chr10
	Extracting 1348 sequences from chr11
	Extracting 1231 sequences from chr12
	Extracting 497 s

	Output Directory = 15-motifscan/aaaad
	Using actual sizes of regions (-size given)
	Fragment size set to given
	Will use repeat masked sequences
	Will find motif(s) in data/motifs.motif
	Using Custom Genome
	Peak/BED file conversion summary:
		BED/Header formatted lines: 24997
		peakfile formatted lines: 0

	Peak File Statistics:
		Total Peaks: 24997
		Redundant Peak IDs: 0
		Peaks lacking information: 0 (need at least 5 columns per peak)
		Peaks with misformatted coordinates: 0 (should be integer)
		Peaks with misformatted strand: 0 (should be either +/- or 0/1)

	Peak file looks good!

	Background fragment size set to 19 (avg size of targets)
	Background files for 19 bp fragments found.
	Custom genome sequence directory: data/genome

	Extracting sequences from file: data/genome/genome.fa
	Looking for peak sequences in a single file (data/genome/genome.fa)
	Extracting 2573 sequences from chr1
	Extracting 1095 sequences from chr10
	Extracting 1369 sequences from chr11
	Extracting 1331

make: Target `cpu' not remade because of errors.


## 1.2 CPU part 1, command description
### preproc selects_rna
This command subsets the RNA data separately for cells of each context/cell type.

In [2]:
!python3 -m dictys preproc selects_rna -h

usage: dictys preproc selects_rna [-h] fi_reads fi_names fo_reads

Select samples/cells based on external table for RNA data.

positional arguments:
  fi_reads    Path of input tsv file of full expression matrix
  fi_names    Path of input text file of sample/cell names to select
  fo_reads    Path of output tsv file of expression matrix of selected
              samples/cells

optional arguments:
  -h, --help  show this help message and exit


### preproc qc_reads
This command removes low-read genes and cells for quality control.

In [3]:
!python3 -m dictys preproc qc_reads -h

usage: dictys preproc qc_reads [-h]
                               fi_reads fo_reads n_gene nc_gene ncp_gene
                               n_cell nt_cell ntp_cell

Quality control by bounding read counts. Quality control is perform separately
on genes based on their cell statisics and on cells based on their gene
statistics, iteratively until dataset remains unchanged. A gene or cell is
removed if any of the QC criteria is violated at any time in the iteration.
All QC parameters can be set to 0 to disable QC filtering for that criterion.

positional arguments:
  fi_reads    Path of input tsv file of read count matrix. Rows are genes and
              columns are cells.
  fo_reads    Path of output tsv file of read count matrix after QC
  n_gene      Lower bound on total read counts for gene QC
  nc_gene     Lower bound on number of expressed cells for gene QC
  ncp_gene    Lower bound on proportion of expressed cells for gene QC
  n_cell      Lower bound on total read

### preproc selects_atac
This command combines per-cell bam files to a single bam file for each cell subset.

In [4]:
!python3 -m dictys preproc selects_atac -h

usage: dictys preproc selects_atac [-h] fi_exp fi_list fo_list

Select chromatin accessibility samples/cells based on presence in expression
matrix.

positional arguments:
  fi_exp      Path of input tsv file of expression matrix. Column must be
              sample/cell name.
  fi_list     Path of input text file of selected cell names, one per line
  fo_list     Path of output text file of selected cell names, one per line

optional arguments:
  -h, --help  show this help message and exit


### chromatin macs2
This command finds chromatin accessibility peaks with macs2

In [5]:
!python3 -m dictys chromatin macs2 -h

usage: dictys chromatin macs2 [-h] [--qcut QCUT] [--nth NTH] [--nmax NMAX]
                              fi_names fi_bam fo_bam fo_bai fo_bed genome_size

Peak calling using macs2. Needs bam files for each cell in a given folder.

positional arguments:
  fi_names     Path of input text file containing one sample/cell name per
               line for macs2 peak calling
  fi_bam       Path of input folder that contains each cell's bam file by name
               in fi_names
  fo_bam       Path of output bam file for select samples/cells
  fo_bai       Path of output bai file for select samples/cells
  fo_bed       Path of output bed file of peaks
  genome_size  Genome size input of macs2. Use shortcuts hs or mm for human or
               mouse.

optional arguments:
  -h, --help   show this help message and exit
  --qcut QCUT  Qvalue cutoff for macs2 (default: 0.05)
  --nth NTH    Number of threads (default: 1)
  --nmax NMAX  Maximum number of peaks to retain, ordered

### chromatin wellington
This command finds transcription factor footprints with wellington/pyDNase

In [6]:
!python3 -m dictys chromatin wellington -h

usage: dictys chromatin wellington [-h] [--fi_blacklist FI_BLACKLIST]
                                   [--cut CUT] [--nth NTH] [--nmax NMAX]
                                   fi_bam fi_bai fi_bed fo_bed

TF footprinting with wellington.

positional arguments:
  fi_bam                Path of input bam file of all reads
  fi_bai                Path of input bai file of all reads
  fi_bed                Path of input bed file of peaks
  fo_bed                Path of output bed file of footprints

optional arguments:
  -h, --help            show this help message and exit
  --fi_blacklist FI_BLACKLIST
                        Path of input bed file of blacklisted genome regions
                        to be removed (default: None)
  --cut CUT             Cutoff for wellington score (default: 10)
  --nth NTH             Number of threads (default: 1)
  --nmax NMAX           Maximum number of footprints to retain, ordered by
                        wellington score. Use

### chromatin homer
This command scans for motif occurrences within provided regions (here open chromatin peaks or footprints) with homer.

In [7]:
!python3 -m dictys chromatin homer -h

usage: dictys chromatin homer [-h] [--nth NTH]
                              fi_bed fi_motif dirio_genome fi_exp fo_bed
                              fo_wellington fo_homer

Motif scan with homer.

positional arguments:
  fi_bed         Path of input bed file of regions
  fi_motif       Path of input motif PWM file in homer format. Motifs must be
                 named in format 'gene...' where gene matches gene names in
                 fi_exp. Should not contain duplicates.
  dirio_genome   Path of input & output folder or file for reference genome
                 for homer. A separate hard copy is recommended because homer
                 may write into the folder to preparse genome.
  fi_exp         Path of input expression matrix file in tsv format. Used for
                 mapping motifs to genes.
  fo_bed         Path of output bed file of detected motifs
  fo_wellington  Path of output tsv file of wellington scores in shape
                 (region,motif)


### chromatin binding
This command computes an integrative score for TF binding based on scores from footprint/peak discovery and from homer.

In [8]:
!python3 -m dictys chromatin binding -h

usage: dictys chromatin binding [-h] [--cuth CUTH] [--cutw CUTW] [--cut CUT]
                                [--combine COMBINE] [--mode MODE]
                                fi_wellington fi_homer fo_bind

Finding TF binding events. Combines wellington and homer outputs to infer TF
binding events by merging motifs to TFs.

positional arguments:
  fi_wellington      Path of input tsv file of wellington output
  fi_homer           Path of input tsv file of homer output
  fo_bind            Path of output tsv file of binding events

optional arguments:
  -h, --help         show this help message and exit
  --cuth CUTH        Homer score cutoff (default: 0)
  --cutw CUTW        Wellington score cutoff (default: 0)
  --cut CUT          Final score (integrating homer & wellington) cutoff
                     (default: None)
  --combine COMBINE  Method to combine scores of motifs of the same TF.
                     Accepts: max, mean, sum. (default: max)
  --mode MODE   

### chromatin tssdist
This command computes the distance between each (open chromatin) region and each gene's transcription start site to prioritize nearby pairs that are more likely to have regulatory effects.

In [9]:
!python3 -m dictys chromatin tssdist -h

usage: dictys chromatin tssdist [-h] [--cut CUT] [--nmin NMIN] [--nmax NMAX]
                                fi_exp fi_wellington fi_tss fo_dist

Annotating TF bond regions to target genes based on distance to TSS.

positional arguments:
  fi_exp         Path of input expression matrix file in tsv format to obtain
                 gene names
  fi_wellington  Path of input tsv file of wellington scores to obtain DNA
                 regions
  fi_tss         Path of input bed file for gene region and strand
  fo_dist        Path of output tsv file of distance from TF-bond regions to
                 TSS

optional arguments:
  -h, --help     show this help message and exit
  --cut CUT      Distance cutoff between DNA region and target gene TSS
                 (default: 500000)
  --nmin NMIN    Minimal total number of links to recover (default: 1)
  --nmax NMAX    Maximal total number of links to recover (default: 10000000)


### chromatin linking
This command links TFs to its potential target genes via the relation: TF --- motif --- region --- gene.


In [10]:
!python3 -m dictys chromatin linking -h

usage: dictys chromatin linking [-h] [--combine COMBINE] [--mode MODE]
                                fi_binding fi_dist fo_linking

Linking regulators and targets with scores.

positional arguments:
  fi_binding         Path of input tsv file of binding events
  fi_dist            Path of input tsv file of distance from TF-bond regions
                     to TSS
  fo_linking         Path of output matrix file of TF to potential target gene
                     link scores

optional arguments:
  -h, --help         show this help message and exit
  --combine COMBINE  Method to combine scores of motifs of the same TF.
                     Accepts: max, mean, sum. (default: max)
  --mode MODE        Mode to compute final score. Accepts binary flags: 4:
                     Subtract log(10)*(distance_to_tss)/1E6 (default: 4)


### chromatin binlinking
This command selects the strongest TF-target gene pairs as a TF binding network that constrains the GRN to be inferred.

In [11]:
!python3 -m dictys chromatin binlinking -h

usage: dictys chromatin binlinking [-h] [--axis AXIS] [--selfreg SELFREG]
                                   [--inf INF]
                                   fi_linking fo_binlinking n

Converting regulator-target link score matrix to binary. Chooses the top
regulator-target links, separately for each target gene by default.

positional arguments:
  fi_linking         Path of input matrix file of TF to potential target gene
                     link scores
  fo_binlinking      Path of output matrix file of TF to potential target gene
                     links
  n                  Number of regulator-target links. n strongest links (with
                     highest scores) are selected along axis axis. If greater
                     than the maximum links available, all links will be
                     selected subject to inf parameter constraint. Value -1
                     selects all non-inf links.

optional arguments:
  -h, --help         show this help messa

## 2.1 GPU part, execution
**This part should be skipped if you use CPU for pytorch.**

This part performs GRN inference with scRNA-seq read counts on GPU.

In [12]:
%%bash
set -eo pipefail
cd ..
make -f makefiles/static.mk -j 2 -k gpu || true


python3 -m dictys  network reconstruct --device cuda:0 --nth 4 tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/binlinking.tsv.gz tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_loss.tsv.gz tmp_static/Subset1/net_stats.tsv.gz
python3 -m dictys  network reconstruct --device cuda:0 --nth 4 tmp_static/Subset10/expression.tsv.gz tmp_static/Subset10/binlinking.tsv.gz tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_loss.tsv.gz tmp_static/Subset10/net_stats.tsv.gz
python3 -m dictys  network reconstruct --device cuda:0 --nth 4 tmp_static/Subset11/expression.tsv.gz tmp_static/Subset11/binlinking.tsv.gz tmp_static/Subset11/net_weight.tsv.gz tmp_static/Subset11/net_meanvar.tsv.gz tmp_static/Subset11/net_covfactor.tsv.gz tmp_static/Subset11/net_loss.tsv.gz tmp_static/Subset11/net_stats.tsv.gz
python3 -m dictys

## 2.2 GPU part, command description
### network reconstruct
This command uses pyro and pytorch to infer the GRN with stochastic process model under the TF binding network constraint.

In [13]:
!python3 -m dictys network reconstruct -h

usage: dictys network reconstruct [-h] [--lr LR] [--lrd LRD] [--nstep NSTEP]
                                  [--npc NPC] [--fi_cov FI_COV]
                                  [--model MODEL]
                                  [--nstep_report NSTEP_REPORT]
                                  [--rseed RSEED] [--device DEVICE]
                                  [--dtype DTYPE] [--loss LOSS] [--nth NTH]
                                  [--varmean VARMEAN] [--varstd VARSTD]
                                  [--fo_weightz FO_WEIGHTZ]
                                  [--scale_lyapunov SCALE_LYAPUNOV]
                                  fi_exp fi_mask fo_weight fo_meanvar
                                  fo_covfactor fo_loss fo_stats

Reconstruct network with any pyro model in net_pyro_models that is based on
covariance_model and has binary masks.

positional arguments:
  fi_exp                Path of input tsv file of expression matrix.
  fi_mask               Path of input tsv 

## 3.1 CPU part 2, execution
**This part should be skipped if you use CPU for pytorch.**

This part performs network postprocessing to address variance estimation bias and indirect effects.

In [14]:
%%bash
set -eo pipefail
cd ..
make -f makefiles/static.mk -j 32 -k cpu || true


python3 -m dictys  network normalize --nth 4 tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_nweight.tsv.gz
python3 -m dictys  network indirect --nth 4 --fi_meanvar tmp_static/Subset1/net_meanvar.tsv.gz tmp_static/Subset1/net_weight.tsv.gz tmp_static/Subset1/net_covfactor.tsv.gz tmp_static/Subset1/net_iweight.tsv.gz
python3 -m dictys  network normalize --nth 4 tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_nweight.tsv.gz
python3 -m dictys  network indirect --nth 4 --fi_meanvar tmp_static/Subset10/net_meanvar.tsv.gz tmp_static/Subset10/net_weight.tsv.gz tmp_static/Subset10/net_covfactor.tsv.gz tmp_static/Subset10/net_iweight.tsv.gz
python3 -m dictys  network normalize --nth 4 tmp_static/Subset11/net_weight.tsv.gz tmp_static/Subset11/net_meanvar.tsv.gz tmp_static/Subset11/net_covfactor.tsv.gz tmp_static/Su

python3 -m dictys  network normalize --nth 4 tmp_static/Subset7/net_iweight.tsv.gz tmp_static/Subset7/net_meanvar.tsv.gz tmp_static/Subset7/net_covfactor.tsv.gz tmp_static/Subset7/net_inweight.tsv.gz


## 3.2 CPU part 2, command description
### network normalize
This command normalizes network edges based on the standard deviation of regulator and target genes. This can overcome biases in the estimation of variance of gene expression due to single-cell sparsity.

In [15]:
!python3 -m dictys network normalize -h

usage: dictys network normalize [-h] [--norm NORM] [--nth NTH]
                                fi_weight fi_meanvar fi_covfactor fo_nweight

Normalize edge strength. So they are more resistant to estimation bias of true
expression level variance.

positional arguments:
  fi_weight     Path of input tsv file of edge weight matrix
  fi_meanvar    Path of iput tsv file of mean and variance of each gene's
                relative log expression
  fi_covfactor  Path of iput tsv file of factors for the off-diagonal
                component of gene covariance matrix
  fo_nweight    Path of output tsv file of normalized edge weight matrix

optional arguments:
  -h, --help    show this help message and exit
  --norm NORM   Type of normalization as binary flag values. Accepts: 1:
                Multiplying edge weight with stochastic noise std of TF 2:
                Dividing edge weight with stochastic noise std of target
                (default: 3)
  --nth NTH     Numbe

### network indirect
This command computes the steady-state total effect (direct + indirect effects) networks from the kinetic direct effect network. The total effect network is **not yet integrated into network analysis functions**.

In [16]:
!python3 -m dictys network indirect -h

usage: dictys network indirect [-h] [--norm NORM] [--fi_meanvar FI_MEANVAR]
                               [--eigmin EIGMIN] [--eigmax EIGMAX]
                               [--multiplier MULTIPLIER] [--nth NTH]
                               fi_weight fi_covfactor fo_iweight

Computes steady-state indirect effect of gene perturbation from OU process.
Performs extra regularization on network by bounding the eigenvalues of
feedback loops with parameters eigmin and eigmax. Values away from 1 indicates
stronger feedback loop effects. Set values closer to 1 to apply stronger
regularization.

positional arguments:
  fi_weight             Path of input tsv file of edge weight matrix
  fi_covfactor          Path of iput tsv file of factors for the off-diagonal
                        component of gene covariance matrix
  fo_iweight            Path of output tsv file of steady-state indirect
                        effect edge weight matrix

optional arguments:
  -h, --help 

## 4.1 Aggregating network part, execution
This part aggregates all the inferred networks into a single h5 file as output. This single file can be shared and processed by network analysis/visualization functions in Dictys.

In [17]:
%%bash
set -eo pipefail
cd ..
#Combine inferred networks to single h5 file
make -f makefiles/static.mk combine
#Optional step: Cleanup intermediate files
make -f makefiles/static.mk clean


mkdir -p output/
python3 -m dictys  network tofile  data tmp_static data/subsets.txt output/static.h5
rm -f tmp_static/Subset1/names_rna.txt tmp_static/Subset1/names_atac0.txt tmp_static/Subset1/names_atac.txt tmp_static/Subset1/expression0.tsv.gz tmp_static/Subset1/expression.tsv.gz tmp_static/Subset1/names_atac.txt tmp_static/Subset1/reads.bam tmp_static/Subset1/reads.bai tmp_static/Subset1/peaks.bed tmp_static/Subset1/footprints.bed tmp_static/Subset1/motifs.bed tmp_static/Subset1/homer.tsv.gz tmp_static/Subset1/wellington.tsv.gz tmp_static/Subset1/binding.tsv.gz tmp_static/Subset1/tssdist.tsv.gz tmp_static/Subset1/linking.tsv.gz tmp_static/Subset1/binlinking.tsv.gz tmp_static/Subset1/net_nweight.tsv.gz tmp_static/Subset1/net_iweight.tsv.gz tmp_static/Subset1/net_inweight.tsv.gz tmp_static/Subset10/names_rna.txt tmp_static/Subset10/names_atac0.txt tmp_static/Subset10/names_atac.txt tmp_static/Subset10/expression0.tsv.gz tmp_static/Subset10/expression.tsv.gz tmp_static/Subset10/names

## 4.2 Aggregating network part, command description
### network tofile
This command aggregates all inferred networks to a single output file.

In [18]:
!python3 -m dictys network tofile -h

usage: dictys network tofile [-h] [--dynamic] [--nettype NETTYPE]
                             [--optional OPTIONAL] [--fi_c FI_C]
                             diri_data diri_work fi_subsets fo_networks

Saving networks to a single file.

positional arguments:
  diri_data            Path of input data folder to load from
  diri_work            Path of input working folder to load from
  fi_subsets           Path of input txt file for cell subset names
  fo_networks          Path of output h5 file for all networks

optional arguments:
  -h, --help           show this help message and exit
  --dynamic            Whether to load a dynamic network instead of a set of
                       static networks (default: False)
  --nettype NETTYPE    Type of network. Accepts: '': Unnormalized direct
                       network 'n': Normalized direct network 'i':
                       Unnormalized steady-state network 'in': Normalized
                       steady-state net