# Input data preparation
This notebooks prepares the data files needed for the cell-type specific gene regulatory network (GRN) inference pipeline.

## Dataset description: 10x multiome
This is a 10x multiome example dataset provided by 10x [here](https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0). By using this notebook, you imply that you have already accepted the terms of use and privacy policy on the above hyperlinked webpage. The dataset summary webpage is also available [here](https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_web_summary.html).

## Preparation of individual input files
This section separately prepares each input file/folder as subsections. In each subsection, we will describe the expected input file, demonstate the preparation script with usage displayed when available, and briefly illustrate the content and/or format of the prepared input file. All these input files are placed in the `data` folder of this inference pipeline.

In [1]:
#Create input data folder
!mkdir ../data

### expression.tsv.gz
Read count matrix of RNA-profiled cells in compressed tsv format.

1. Download expression data in mtx.gz format

In [2]:
%%bash
set -eo pipefail
cd ../data
wget -q -o /dev/null https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz
tar xf pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz
rm pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz


2. Convert from mtx.gz to tsv.gz format using helper script `expression_mtx.py`.

Usage:

In [3]:
!dictys_helper expression_mtx.py -h

usage: expression_mtx.py [-h] [--column COLUMN] input_folder output_file

Converts mtx.gz format expression file to tsv.gz format.

positional arguments:
  input_folder     Input folder that contains matrix.mtx.gz, features.tsv.gz,
                   and barcodes.tsv.gz.
  output_file      Output file in tsv.gz format

optional arguments:
  -h, --help       show this help message and exit
  --column COLUMN  Column ID in features.tsv.gz for gene name. Starts with 0.
                   Default: 1.


In [4]:
%%bash
set -eo pipefail
cd ../data
dictys_helper expression_mtx.py filtered_feature_bc_matrix expression.tsv.gz
rm -Rf filtered_feature_bc_matrix




See how it looks like

In [5]:
%%bash
printf '%-10s%20s%20s%20s\n' '' $(cat ../data/expression.tsv.gz | gunzip | head -n 5 | awk -F "\t" '{print $1"\t"$2"\t"$3"\t"$4}')

            AAACAGCCAAGGAATC-1  AAACAGCCAATCCCTT-1  AAACAGCCAATGCGCT-1
A1BG                         0                   0                   0
A1BG-AS1                     0                   2                   1
A1CF                         0                   0                   0
A2M                          1                   0                   2


### bams
This folder contains one bam file for each cell with chromatin accessibility measurement. File name should be cell name.

1. Download all chromatin accessibility reads in bam format

Note:
* **This step can take hours or even over a day** depending on your internet connection

In [6]:
%%bash
wget -q -o /dev/null -O ../data/bams.bam https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_possorted_bam.bam

2. Split bam file to individual bam files for each cell using helper script `split_bam.sh`.

Note:
* **This step can take hours or even over a day**
* The default setting will need ~30GB of memory for this dataset. Specify a lower `BUFFER_SIZE` below if you have less memory.

Usage:

In [7]:
!dictys_helper split_bam.sh

Usage: split_bam.sh [-h] whole.bam output_folder [arguments ...]
Splits input whole.bam file by cell barcode and per-barcode bam files to output folder
whole.bam       Input whole bam file containing reads with different barcodes
output_folder   Output folder with one text file per barcode
arguments       Arguments passed to split_bam_text.py
-h              Display this help


In [8]:
!dictys_helper split_bam_text.py -h

usage: samtools view whole.bam | python3 split_bam_text.py [-h] [--output_unknown OUTPUT_UNKNOWN] [--section SECTION] [--buffer_size BUFFER_SIZE] [--ref_expression REF_EXPRESSION] [--namemap NAMEMAP] output_folder

Splits input bam file (stdin from samtools view) by cell barcode and outputs
headerless individual text file per barcode to output folder.

positional arguments:
  output_folder         Output folder with one text file per barcode

optional arguments:
  -h, --help            show this help message and exit
  --output_unknown OUTPUT_UNKNOWN
                        Output text file for reads without barcodes or with
                        unknown barcodes (see --ref_expression)
  --section SECTION     Section header that contains cell barcode. Must be the
                        same list of cell barcodes/names as other places in
                        the pipeline, e.g. `subsets/*/names_atac.txt` and
                        `coord_atac.tsv.gz`. Default: "RG:

In [9]:
%%bash
set -eo pipefail
cd ../data
dictys_helper split_bam.sh bams.bam bams --section "CB:Z:" --ref_expression expression.tsv.gz
rm bams.bam





See how it looks like

In [10]:
%%bash
ls -h1s ../data/bams | head

total 39G
 8.0M AAACAGCCAAGGAATC-1.bam
 3.3M AAACAGCCAATCCCTT-1.bam
 2.7M AAACAGCCAATGCGCT-1.bam
 224K AAACAGCCACACTAAT-1.bam
 1.3M AAACAGCCACCAACCG-1.bam
 1.7M AAACAGCCAGGATAAC-1.bam
 6.0M AAACAGCCAGTAGGTG-1.bam
 3.2M AAACAGCCAGTTTACG-1.bam
 3.2M AAACAGCCATCCAGGT-1.bam


### subsets & subsets.txt
* subsets.txt: Names of cell subsets. For each cell subset, a GRN is reconstructed.
* subsets: Folder containing one subfolder for each cell subset as in `subsets.txt`. Each subfolder contains two files:
    - names_rna.txt: Names of cells that belong to this subset and have transcriptome measurement
    - names_atac.txt: Names of cells that belong to this subset and have chromatin accessibility measurement
    - For joint measurements of RNA and ATAC, these two files should be identical in every folder.

Here the downloaded clustering result is used to define subsets. You can replace it with your custom clustering.

In [11]:
%%bash
#Location of clustering file
file_cluster='analysis/clustering/gex/graphclust/clusters.csv'

set -eo pipefail
cd ../data
wget -q -o /dev/null https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_analysis.tar.gz
#Extract cell names for each cluster
tar xf pbmc_granulocyte_sorted_10k_analysis.tar.gz 
subsets="$(tail -n +2 "$file_cluster" | awk -F , '{print $2}' | sort -u)"
echo "$subsets" | awk '{print "Subset"$1}' > subsets.txt
for x in $subsets; do
	mkdir -p "subsets/Subset$x"
	grep ",$x"'$' "$file_cluster" | awk -F , '{print $1}' > "subsets/Subset$x/names_rna.txt"
	cp "subsets/Subset$x/names_rna.txt" "subsets/Subset$x/names_atac.txt"
done
rm -Rf pbmc_granulocyte_sorted_10k_analysis.tar.gz analysis


See how it looks like

In [12]:
#Cell subset list
!head ../data/subsets.txt

Subset1
Subset10
Subset11
Subset12
Subset13
Subset14
Subset2
Subset3
Subset4
Subset5


In [13]:
#RNA cell barcodes for Subset 1
!head -n 4 ../data/subsets/Subset1/names_rna.txt

AAACAGCCAATCCCTT-1
AAACAGCCAGTTTACG-1
AAACCAACAGGATGGC-1
AAACGGATCATGGCTG-1


In [14]:
#ATAC cell barcodes for Subset 1. They are identical because it's a joint profiling dataset.
!head -n 4 ../data/subsets/Subset1/names_atac.txt

AAACAGCCAATCCCTT-1
AAACAGCCAGTTTACG-1
AAACCAACAGGATGGC-1
AAACGGATCATGGCTG-1


### motifs.motif
All motifs in HOMER format. Motifs must be named as TF_... where TF is the TF gene name matching those in expression.tsv.gz. The same motif can appear more than once under different names to link it to multiple TFs. [Log odds detection threshold](http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html) must be valid. Motif file can be obtained from different motif databases, e.g. from [HOCOMOCO](https://hocomoco11.autosome.org/downloads_v11) or provided by HOMER.

You can use **either** of the motif databases below or provide your own motifs.

Note:
* **Choose only one database** (homer or HOCOMOCO) below for your motifs
* Any database may have gene symbols not matching your gene symbols. You can (and are recommended to) confirm the discrepancy in the checking step and manually match gene names by editing `motifs.motif`. This step is omitted for this tutorial.

#### From homer
Homer motifs are extracted directly from its installation using helper script `motif_homer.sh`. Usage:

In [15]:
!dictys_helper motif_homer.sh -h

Usage: motif_homer.sh [-b basedir] [ (-m mapfile) | (-o organism) ] [-c capitalization] [-h]
Extracts motif file from homer installation to stdout
-b basedir          Base directory of homer installation
                    Default: autodetect
-m mapfile          Mapfile mode: use motifs in $basedir/motifs/ by specifying a motif to gene mapping file.
                    The mapping file is a two-column headered text file mapping motif file (column 0) to gene name (column 1).
                    Default: $basedir/motifs/extras/motifs2symbol.txt
-o organism         Organism mode: use motifs in $basedir/data/knownTFs/organism/known.motifs by specifying an organism.
                    Each motifs is directly mapped to the gene in the front of its name (separated by :).
                    Those without gene names are kept but will be disgarded in the inference pipeline.
                    If option is unspecified, uses -m with its default setting.
-c capitalization   Capitaliz

To use homer motif:

In [16]:
%%bash
dictys_helper motif_homer.sh > ../data/motifs.motif



See how it looks like

In [17]:
!head -n 18 ../data/motifs.motif

>ATGACTCATC FOS_ap1_motif 6.049537 -1.782996e+03 0 9805.3,5781.0,3085.1,2715.0,0.00e+00
0.419	0.275	0.277	0.028
0.001	0.001	0.001	0.997
0.010	0.002	0.965	0.023
0.984	0.003	0.001	0.012
0.062	0.579	0.305	0.054
0.026	0.001	0.001	0.972
0.043	0.943	0.001	0.012
0.980	0.005	0.001	0.014
0.050	0.172	0.307	0.471
0.149	0.444	0.211	0.195
>SCCTSAGGSCAW TFAP2C_ap2gamma_motif 6.349794 -24627.169865 0 T:26194.0(44.86%),B:5413.7(9.54%),P:1e-10695
0.005	0.431	0.547	0.017
0.001	0.997	0.001	0.001
0.001	0.947	0.001	0.051
0.003	0.304	0.001	0.692
0.061	0.437	0.411	0.091
0.688	0.004	0.289	0.019


#### From HOCOMOCO
HOCOMOCO motifs are downloaded from [their website](https://hocomoco11.autosome.org/). Several versions and significance levels are available.

To use HOCOMOCO motifs (here v11, full collection, human, P<0.0001):

In [18]:
%%bash
wget -q -o /dev/null -O ../data/motifs.motif 'https://hocomoco11.autosome.org/final_bundle/hocomoco11/full/HUMAN/mono/HOCOMOCOv11_full_HUMAN_mono_homer_format_0.0001.motif'

See how it looks like

In [19]:
!head -n 18 ../data/motifs.motif

>dKhGCGTGh	AHR_HUMAN.H11MO.0.B	3.3775000000000004
0.262728374765856	0.1227600511842322	0.362725638699551	0.25178593535036087
0.07633328991810645	0.08258130543118362	0.22593295481662123	0.6151524498340887
0.14450570038747923	0.28392173880411337	0.13815442099009081	0.4334181398183167
0.023935814057894068	0.016203821748029118	0.9253278681170539	0.03453249607702277
0.007919544273173793	0.953597675415874	0.017308392078009837	0.021174388232942286
0.02956192959210962	0.012890110758086997	0.9474192747166682	0.010128684933135217
0.007919544273173797	0.029561929592109615	0.012337825593096645	0.9501807005416201
0.007919544273173793	0.007919544273173793	0.9762413671804787	0.007919544273173793
0.27886589130660366	0.4285328543459993	0.10955683916661985	0.18304441518077724
>hnnGGWWnddWWGGdbWh	AIRE_HUMAN.H11MO.0.C	5.64711
0.38551919443239085	0.2604245534178759	0.1353299124033618	0.21872633974637148
0.18745267949274294	0.18745267949274294	0.14575446582123766	0.4793401751932764
0.1457544658

### genome
Folder containing reference genome in HOMER format. Creating a separate copy from the original location is recommended because HOMER creates preparsed files in this folder.

Reference genome is extracted from homer using helper script `genome_homer.sh`. Alternatively, you can place your custom genome in the same location.

Note:
* **You need the same reference genome version with chromatin accessibility reads**

Usage:

In [20]:
!dictys_helper genome_homer.sh -h

Usage: genome_homer.sh [-b basedir] [-h] refgenome output_dir
Extracts reference genome from homer installation to output directory
refgenome           Name of reference genome in homer format, e.g. hg38.
                    You can get reference genomes available in homer with $basedir/configureHomer.pl -list
output_dir          Output directory to export reference genome as
-b basedir          Base directory of homer installation
                    Default: autodetect
-h                  Display this help


In [21]:
%%bash
dictys_helper genome_homer.sh hg38 ../data/genome

Downloading reference genome hg38 in homer


See how it looks like

In [22]:
%%bash
ls -h1s ../data/genome | head

total 4.4G
4.0K annotations
 12K chrom.sizes
3.1G genome.fa
3.2M hg38.aug
 42M hg38.basic.annotation
673M hg38.full.annotation
164K hg38.miRNA
505M hg38.repeats
 24M hg38.rna


### gene.bed
Bed file of gene regions and strand information to locate transcription start sites. You can download a GTF/GFF file and convert it to this bed file. Note that gene names must be in the same format as in expression.tsv.gz.

1. Download GTF file from [ensembl](http://useast.ensembl.org/info/data/ftp/index.html/)

Note:
* **The GTF file should have the same reference genome version with chromatin accessibility reads**

In [23]:
%%bash
cd ../data
wget -q -o /dev/null -O gene.gtf.gz http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.gtf.gz
gunzip gene.gtf.gz

2. Extract gene regions from GTF file using helper script `gene_gtf.sh`

Usage:

In [24]:
!dictys_helper gene_gtf.sh -h

Usage: gene_gtf.sh [-f field] [-h] gtf_file bed_file
Extracts gene region from GTF file into bed file
gtf_file        Path of input GTF file
bed_file        Path of output BED file
-f field        Field name to extract. Default: gene_name
-h              Display this help


In [25]:
%%bash
dictys_helper gene_gtf.sh ../data/gene.gtf ../data/gene.bed
rm ../data/gene.gtf

See how it looks like

In [26]:
!head ../data/gene.bed

chr1	11869	14409	DDX11L1	.	+
chr1	14404	29570	WASH7P	.	-
chr1	17369	17436	MIR6859-1	.	-
chr1	29554	31109	MIR1302-2HG	.	+
chr1	30366	30503	MIR1302-2	.	+
chr1	34554	36081	FAM138A	.	-
chr1	52473	53312	OR4G4P	.	+
chr1	57598	64116	OR4G11P	.	+
chr1	65419	71585	OR4F5	.	+
chr1	131025	134836	CICP27	.	+


## Optional input files
### blacklist.bed
Bed file of regions to exclude in chromatin accessibility analyses