# Reproducing the cell type motif enrichment

This workbook explains the process of reproducing the cell type motif enrichment
analysis as seen in our publication. The steps covered are:

  * Predicitng genome-wide with and without the global signal for the cell type of interest
  * Running motif analysis
  * Running cell type enrichment analysis on identified transcription factors

## 1. Predicting genome-wide with and without the global signal for the cell type of interest

First, predict genome-wide (chromosomes 1-22) for a cell type of interest and save the result as a bigWig file (a separate one for each histone mark). See below for the command to use with the EnformerCelltyping conda env to make a bigwig for each histone mark for the microglia from Nott et al., 2019:

```
python ./bin/predict_genome.py -c Nott19_Microglia -p data/demo/Nott19_Microglia_128.bigWig -o ./model_results/predictions/
```

Note this should only really be run with a GPU as it takes 33 hours with a GPU. The script can be adapted to run in parallel. Also make sure to process your ATAC-seq data as described in section 1.1 before running the above on it.

Second Note - if you want to predict in more than one cell type/tissue, I advise precomputing ands saving the DNA embedding by passign it through enformer, this will massively speed up genome-wide predictions for your cell type of interest so pays off if you (eventually) want to look at more than 1 cell type. This does require a substantial amount of disk space (~240 GB). To use this approach, first precomute the DNA embeddings with:

```
python ./bin/precompute_dna_embeddings.py 
```

Then use these embeddings when predicting in your cell type of interest with:

```
python ./bin/predict_genome_precomp.py -c Nott19_Microglia -p data/demo/Nott19_Microglia_128.bigWig -o ./model_results/predictions/
```

You can change the DNA Embedding directory with the -d parameter.

Next, we need to predict the genome-wide signal without the global chromatin accessibility signal to identify the peaks that rely on it. This can be done with the microglia from Nott et al., 2019:

```
python ./bin/predict_genome.py -c Nott19_Microglia_no_gbl -p data/demo/Nott19_Microglia_128.bigWig -o ./model_results/predictions/ -g 0
```

Note I changed the name of the output and used the -g paramter to remove the global signal. We can also run this with the precomputed embeddings:

```
python ./bin/predict_genome_precomp.py -c Nott19_Microglia_no_gbl -p data/demo/Nott19_Microglia_128.bigWig -o ./model_results/predictions/ -g 0
```

## 2. Running motif analysis

Next to process the peaks and run motif analysis using [Homer](http://homer.ucsd.edu/homer/motif/), run:

```
bash ./bin/motif_analysis.sh Nott19_Microglia
```

Note that this uses two conda environments (`EnformerCelltyping` and `bigwigtobed`), you can get both from the `./environment` folder yml files.

To identify the peaks to use for motif analysis, the script ranks peaks by those most and least affected by the removal of the global chromatin accessibility. The DNA motifs in the top and bottom deciles are passed to Homer separately to identify transcription factor enrichment for peaks that are and are not affected by the global chromatin accessibility signal for the cell.

This will create two files:

* `./model_results/motif_analysis/homer_background_tfs.csv` - This is the background list of all TF's tested for with Homer
* `./model_results/motif_analysis/all_cells_sig_gbl_motifs.csv` - The TF's Homwer found significantly enriched in the motfisd from the histone mark peaks relating to the global cell type chromatin accessibility signal.

These two files will then be used to run the cell type enrichment.

## 3. Running cell type enrichment analysis on identified transcription factors

Run [EWCE](https://www.frontiersin.org/articles/10.3389/fnins.2016.00016/full) to look for cell type enrichment of the transcription factors identified from the motif analysis. The R script used to run this and plot the results is here:

```
Rscript ./bin/cell_type_enrichment.R
```

This uses the [Descartes](doi:10.1126/science.aba7721) whole body dataset so should give an cell type enrichment applicable to all, high level cell types.

We ran this for 7 cell types for the manuscript and the results looked like this:

![cell type enrichment analysis](images/tf_cell_specificity_top10_bot1.png)