# Overview

Preprocessing of ASAP-seq and DOGMA-seq datasets.

## Documentation on preprocessing from Mimitou publication (methods) and online code
For reference.
### ASAP-seq
> "Similarly, we identified high-quality cells from the ASAP-seq dataset such that each cell had a TSS score exceeding 4 and a minimum of 1,000 fragments. Cells were further filtered out if they had excess abundance of total protein tags (>25,000 in either condition) or tags measured from the rat isotype controls (>75 in either condition)" From Methods in Mimitou et al., 2021

#### ATAC
- TSS enrichment > 4 (methods & [online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stimulation_asapseq_citeseq/code/02_ArchR_firstpass.R))
- fragments > 1000 (methods & [online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stimulation_asapseq_citeseq/code/02_ArchR_firstpass.R))
- doublet removal with ArchR ([online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stimulation_asapseq_citeseq/code/02_ArchR_firstpass.R))

#### ADT 
- counts >= 100 ([online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stimulation_asapseq_citeseq/code/01_ADT_preprocess_ASAP.R))
- counts <= 25000 (methods & [online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stimulation_asapseq_citeseq/code/01_ADT_preprocess_ASAP.R))
- counts for rat isotype control <= 75 (methods & [online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stimulation_asapseq_citeseq/code/01_ADT_preprocess_ASAP.R))


### DOGMA-seq
#### GEX
- pct_mito < 30 ([online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R))
- n_total_rna > 1000 ([online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R))
- n_feature_rna > 500 ([online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R))
- nCount_RNA < (10^4.5)

#### ADT
- totalCTRLadt < 10 (https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R))
- totalADT > 100 (https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R))
- !(full_meta$CD8adt > 30 & full_meta$CD4adt > 100)) (mutually exclusive) (https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R))

#### ATAC
- pct_in_peaks > 50 ([online code](https://github.com/caleblareau/asap_reproducibility/blob/master/pbmc_stim_multiome/code/11_setup.R)) - this seems to rely on CellRanger preprocessing/output for peaks, etc. for the ATAC data, ArchR was not used



## Own preprocessing choices
Attempt to harmonize preprocessing, if anything, be more strict!:
- no doublet filtering for ASAP-seq ATAC data
- data adaptive upper thresholds for ATAC and GEX data
- Unified informative feature finding strategy for ATAC (Mimitou: ArchR for ASAP-seq and CellRanger for DOGMA-seq) - I am using orthogonal method scregseg





For simplicity, I prefilter all ATAC data based on what was done for ASAP seq, then call regions with ScregSeg

Afterward, I will remove cells from that if paired ADT or RNA is not well, however, for region calling the data should be good enough.



As I want to integrate both of these different datasets, I want to apply as similar preprocessing as possible.

I dedcided to be as stringent as possible and apply all filters of both (where possible)

I will not filter doublets with ArchR (as was done for ASAP-seq), but only make use of the "soft" doublet removal through making high cd4 cd8 counts mutually exclusive.

### ASAP-seq and DOGMA-seq (applicable modalities)
#### ATAC 
- TSS score exceeding 4 
- minimum of 1,000 fragments
- Additionally, after this initial filtering remove high counts based on Q3 + 1.5x IQR (after having removed low counts!)

All these cells are used for finding informative regions with scregseg (detailed below), however, these cells might be dropped later for not meeting the threshold of the other modalities!

#### ADT
 - total tags need to be <= 25,000 and > 100
  - ASAP: tags measured from the rat isotype controls <=75 
 - DOGMA: totalCTRLadt < 10 %(I think there is no such thing in the DOGMA-seq data, in the original code they do a regex search for "Ctrl" on the protein names, and there are none) - This was an error: index was unnamed! There are isotype controls!
 - ~((CD8_counts > 30) & (CD4_counts > 100))
 - don't remove high counts, as these are most likely the cells of interest..
 

#### RNA (only applicable to DOGMA-seq datasets)
 - pct_mito < 30
 - n_total_rna > 1000 ~< (10^4.5)~
 - Use data adaptive threshold as upper threshold instead Q3 + 1.5x IQR (after having removed/excluding cells with low counts (<= 1000)!) n_total_rna <= q3 + 1.5 * iqr for top boundary!
 - n_feature_rna > 500
 - drop mitochondrial genes from analysis
 






## In practice
### Downloaded data to data/original
See README.md in data/original/Mimitou2021

### ATAC-specific (DOGMA-seq and ASAP-seq)
Informative features called with ScregSeg, for orientation/explanation see [McGarvey et al., 2022](https://doi.org/10.1016/j.xgen.2021.100083) and [tutorials](https://github.com/BIMSBbioinfo/scregseg/). In a nutshell, ScregSeg is a computational tool based on Hidden Markov Models that can be used as an alternative to peak calling for finding informative features for ATAC-seq data.

#### ATAC prior to notebooks
Provided fragment files for both DOGMA-seq and ASAP-seq data are gzipped BED files, however, ArchR, which is used for initial filtering requires 10x-like fragment files. These are sorted BED files that are bgzipped!

To this end, I need to unzip, sort, and bgzip all fragment files. These are in /data/derived!

```01_TCU_fragments_to_10x_fragments.sh```

#### Filtering of ATAC data with ArchR
```02_TCU_ATAC_filtering_ArchR.R```

Initial filtering based on number of fragments and TSS enrichment.


#### Scregseg first pass (shell script)
For details see: ```03_TCU_ScregSeg_first_pass.sh```

Settings for individual steps:
- scregseg make_tile
    - 1000 kb bins
    - only autosomes 
- scregseg fragments_to_counts
- scregseg subset
    - subsets to only those cells meeting criterion of ```02_TCU_ATAC_filtering_ArchR.R``` (> 1000 fragments and TSS enrichment > 4)  
- scregseg filter
    - only keep regions with at least 1 count (after subsetting) across all cells and binarize data

#### 04_ATAC_concatenate_scregseg_results.ipynb
```04_TCU_ATAC_concatenate_scregseg_results.ipynb```

#### Scregseg second pass (shell script)
```05_TCU_ScregSeg_second_pass_50_states.sh```
- scregseg fit_segment
    - 7 random runs
    - HMM with 50 states
    - 300 iteration starting from random initial parameters for each run
    
#### 06_TCU_Explore_scregseg_second_pass.ipynb
```06_TCU_Explore_scregseg_second_pass.ipynb```
Determine thresholds for exporting informative regions

#### Scregseg third pass (shell script)
```07_TCU_ScregSeg_third_pass_50_states.sh```
- scregseg seg_to_bed
    - use regions of states that cover at most 1.5% of the genome as informative regions/features and that reach a posterior decoding probability of at least 0.9

### GEX- and ADT-specific steps
Detailed in ```08_Preprocess_GEX_and_ADT_data_and_combine_with_ATAC.ipynb```

Combination of data.