# Analysis of pcHi-C data from human CD4+ T cells and ILC3 cells

This notebook will walk through the analysis of pcHi-C data generated by Mikhail Spivakov's group. We will also analyze CD4+ alpha-beta T-cells in addition to the ILC3 data as a positive control sample.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Overview
We have two cell types to analyze:

* Human Tonsil ILC3
* Human Venous Blood CD4+ alpha/beta T-cells

These two cell types were analyzed with two different analysis methods:

* 5kb bin level - interactions are approximately 5k bins instead of the size of restriction fragments
* fragment level - interactions are the size of the restriction fragment made by DPNII

### Update 06.06.2022
Valeriya emailed me [new data](https://www.transferxl.com/manage/089FGscZhgKG8?token=6ao3nPY77S1Ed5_eBlxd3nYkCu-fgYAUBqjUtj9s74yAogdy2o8_QW0W1gq6wcdI-A%2C%2C) to analyze using the following approach:
 
> You will see that I also attached a CD4 CHiCAGO peakmatrix without ABC, because I now rerun CHiCAGO with more biological replicates merged together. If you could re-run RELI for that file too, it would be really great. Thank you!!

> Removing trans-chromosomal interactions can be achieved by removing the NA distance values. There should be no interactinos with dist = 0 in the Chicago peakmatrices. The Capture Hi-C-adapted ABC analysis can generate pairs with dist = 0 and I should have removed those already from the peakmatrices that I sent you, but it wouldn’t harm to remove again in your pipeline just in case.

>CHiCAGO score >=5 is oerfect, when working with peakmatrices generated from CHiCAGO interactions only. However, for the pekamatrices, containing both CHiCAGO contacts and ABC pairs, please use the merged_score column >=5 for filtering.

>For the CD4+ data I have now run CHiCAGO on 4 replicates (two reps for 1M and two reps for 50K). So, for filtering you should just use either the score scolumn (in case of the CHiCAGO peakmatrix) and merged_score column (in case of the CHiCAGO_ABC peakmatrix). So there is no need to look at the scores in previous peakmatrices.

>please ignore the remove_line column.

>don’t worry about the off-targets in the OEName column.

>The promoter-promoter interactions should be removed for the RELI analysis. Have you noticed any difference for when you run with and without these interactions?
___

## Data

### RNA-seq

* ILC3 RNA-seq data was gathered from [GSE130775](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130775). 
* CD4+ gene expression was downloaded from [GSE87254](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87254).

#### Approach
1. I used Salmon to quasi-map reads to transcripts. The data was aligned to hg38 annotations from ENSEMBL. I used the following site to fetch the [hg38 genome](http://uswest.ensembl.org/Homo_sapiens/Info/Index). I used the site [HBC Github Salmon](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/04_quasi_alignment_salmon.html) tutorial as a guide for my analysis.

2. I imported the transcript counts and collapsed them to gene counts using [Tx import](https://bioconductor.statistik.tu-dortmund.de/packages/33/bioc/vignettes/tximport/inst/doc/tximport.html) as described in the vignette.

3. I exported the gene expression matrix for ILC3 here: `20200706_hILC3_CD4_GeneCounts.tsv`.

### ChIP-seq

* I obtained ILC3 histone marker data from [GSE77299](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) 
* CD4+ histone marker data from [Blueprint Project](http://dcc.blueprint-epigenome.eu/#/experiments). 

Detailed methods can be found in the write-up.

### ATAC-seq

1) I obtained ILC3 ATAC-seq data from [GSE77299](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) and CD4+ ATAC-seq data was gathered from [PRJNA380283](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623106/). 
2) The CD4+ ATAC-seq data was mapped to hg19 so UCSC `liftover` was used to convert the ATAC-seq data to hg38. Detailed methods can be found in the write-up.

### Regulatory Elements

The RE features used in this analysis are the union of ATAC-seq data with H3K27ac data.

___

## Getting PIRs From CHiCAGO output file

I wrote a python class object to work with the CHiCAGO output text file. This piece of code will perform filtering of specific types of interactions, like promoter-to-promoter, or trans-chromosomal interactions. 

Example:

```python
input_file = "/Users/caz3so/scratch/20220606_spivakov_pchic_reanalysis/TransferXL-089FGscZhgKG8/ILC_5kb_within_newbmap_CHiCAGO_ABC_peakm.txt"

ILC3_data = ChicagoData(input_file)

ILC3_data.df(input_file, dropna=True, drop_off_target=True, drop_p2p=False, drop_trans_chrom=True)

ILC3_data.pir_df
```

In [3]:
class ChicagoData(object):
    """Import CHICACO data
    """
    def __init__(self,
                 filename: str,
                 dropna: bool = True,
                 drop_off_target: bool = True,
                 drop_trans_chrom: bool = True,
                 score_col: str = None
                 ):
        """Initialize the object

        Args:
            filename (str): Input CHICAGO txt file
            dropna (bool, optional): Drop the interactions with NA. Defaults to =True.
            drop_off_target (bool, optional): Drop off target interactions. Defaults to =True.
            drop_trans_chrom (bool, optional): Drop transchromosomal interactions. Defaults to =True.
        """
        # Set filename to the input filename
        self.filename = filename
        # Set whether to drop na interactions
        self.dropna = dropna
        # Set whether to drop off target
        self.drop_off_target = drop_off_target
        # Set whether to drop transchromosomal interactions
        self.drop_trans_chrom = drop_trans_chrom
        # Score column name for filtering
        self.score_col = score_col
        
        # Read file into DF
        self._read_file_()
        
        # Format the DF
        self._format_file_()
        
        # Filter the formatted DF
        self._filter_file_()

        # Get the PIR df
        self._get_PIR_df_()
        
    def _read_file_(self):
        """Read in original file
        """
        # Read in original file and save
        self.input_df =  pd.read_csv(self.filename, sep="\t", header=0, low_memory=False)
    
    def _format_file_(self):
        """Format CHICAGO file
        """
        # Create a copy of the raw input to be manipulated
        df = self.input_df.copy()
        
        # Format the chromosome names
        df["baitChr"] = "chr" + df["baitChr"].apply(str)
        df["oeChr"] = "chr" + df["oeChr"].apply(str)
        
        # Calculate the wide of the intervals
        df["OE_width"] = df["oeEnd"] - df["oeStart"]
        
        # Create an ID column that can be used to track the intervals
        df["ID"] = df["baitChr"] + ":" + \
                   df["baitStart"].apply(str) + "-" + \
                   df["baitEnd"].apply(str) + "_" + \
                   df["oeChr"] + ":" + \
                   df["oeStart"].apply(str) + "-" + \
                   df["oeEnd"].apply(str)

        # Set the variable to the formatted df
        self.df = df
    
    def _filter_file_(self):
        """Filter the formatted CHICAGO results
        """
        if self.dropna:
            self.df.dropna(subset=["dist"], inplace=True)

        if self.drop_off_target:
            self.df[self.df["baitName"] != "off_target"]

        if self.drop_trans_chrom:
            self.df = self.df[self.df["baitChr"] == self.df["oeChr"]]
        
        if self.score_col:
            self.df = self.df[self.df[self.score_col] >= 5]

        
    def _get_PIR_df_(self):
        """Get a DF of all PIR interactions
        """
        self.pir_df = self.df[["oeChr", "oeStart", "oeEnd", "OE_width"]].drop_duplicates(subset=["oeChr", "oeStart", "oeEnd"], keep="first")

    def write_PIR_bed(self, output_filename): 
        """Write PIRs from the filtered CHICAGO results to a bed file
        """       
        self.pir_df.to_csv(output_filename, sep="\t", header=False, index=False)
