# Analysis of pcHi-C data from human CD4+ T cells and ILC3 cells

This notebook will walk through the analysis of pcHi-C data generated by Mikhail Spivakov's group. We will also analyze CD4+ alpha-beta T-cells in addition to the ILC3 data as a positive control sample.

## Imports

In [1]:
from unittest import skip
import pandas as pd
from scipy.stats import pearsonr
from scipy.stats import spearmanr
import pybedtools
import os
from os import makedirs, error


In [2]:
# importing sys
import sys
 
# adding python folder to the system path
sys.path.insert(0, '../python/')
 
from ChicagoData import ChicagoData


## Getting PIRs From CHiCAGO output file

I wrote a python class object to work with the CHiCAGO output text file. This piece of code will perform filtering of specific types of interactions, like promoter-to-promoter, or trans-chromosomal interactions. 

Example:

```python
input_file = "../data/CHICAGO/hg38/inputs/ILC_5kb_within_newbmap_CHiCAGO_ABC_peakm.txt"

ILC3_data = ChicagoData(input_file)

ILC3_data = ChicagoData(input_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True)
                        
ILC3_data.pir_df
```

## Test Example

In [11]:
# Input Chicago File
input_file = "../data/CHICAGO/hg38/inputs/ILC_5kb_within_newbmap_CHiCAGO_ABC_peakm.txt"

In [12]:
ilc3_file_dict = {"/Users/caz3so/workspaces/tacazares/pchic/data/peaks/ATAC/ILC3_ATAC_peaks.bed": "ATAC",
                  "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/ILC3_H3K27ac_peaks.bed": "H3K27ac",
                  "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/ILC3_H3K4me3_peaks.bed": "H3K4me3",
                  "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/RE/ILC3_RE.bed": "RE"}

In [13]:
ILC3_data = ChicagoData(input_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=ilc3_file_dict,
                        gene_expression="/Users/caz3so/workspaces/tacazares/pchic/data/RNA/ILC3_mean_expression.tsv",
                        output_dir="/Users/caz3so/workspaces/tacazares/pchic/data/outputs",
                        output_basename="ILC_5kb_within_newbmap_CHiCAGO_ABC_peakm")

TypeError: _map_ABC_labels_() missing 1 required positional argument: 'row'

In [34]:
ILC3_data.pir_count_v_mean

Unnamed: 0,RE_count,Gene_Name,Mean_Gene_Expression
0,0,A1CF,5.166667
0,0,A2MP1,4.833333
0,0,AADAT,3.666667
0,0,AARD,81.833333
0,0,ABCA13,57.166667
...,...,...,...
97,148,TTF2,1940.666667
98,152,CD44,12834.500000
98,152,NSMCE1,8499.833333
99,320,ID2,24945.500000


## Process ILC3 and CD4 CHiCAGO Files with ABC interactions included

In [2]:
pwd

'/Users/caz3so/workspaces/tacazares/pchic/notebooks'

In [3]:
out_dir = "../data/outputs"

### Set up the inputs

Set the input file variables

In [4]:
ilc3_file = "../data/CHICAGO/hg38/inputs/ILC_5kb_within_newbmap_CHiCAGO_ABC_peakm.txt"
#ilc3_old_file = "../data/CHICAGO/hg38/inputs/ILC3_merged_bin5K_score5.txt"

#cd4_1M_old_file = "../data/CHICAGO/hg38/inputs/CD4_1M_50K_merged_reweighting_peakmatrix_score5.txt"
#cd4_file = "../data/CHICAGO/hg38/inputs/CD4_1M_50K_5kb_within_newbmap_CHiCAGO_peakm.txt"
cd4_ABC = "../data/CHICAGO/hg38/inputs/CD4_1M_50K_5kb_within_newbmap_CHiCAGO_ABC_peakm.txt"

### Optional: Map the number of features that overlap the promoter interacting regions of the CHiCAGO interactions
One optional input to the object is a dictionary of bed files that correspond to genomic features. These features can be mapped back to the CHiCAGO interactions if they overlap a promoter interacting region (PIR). The format should be a dictionary with the key being the filename and the value being the feature it represents. The value is used to create the column names and filenames. 

In [5]:
ilc3_file_dict = {"/Users/caz3so/workspaces/tacazares/pchic/data/peaks/ATAC/ILC3_ATAC_peaks.bed": "ATAC",
                  "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/ILC3_H3K27ac_peaks.bed": "H3K27ac",
                  "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/ILC3_H3K4me3_peaks.bed": "H3K4me3",
                  "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/RE/ILC3_RE.bed": "RE"}

cd4_file_dict = {"/Users/caz3so/workspaces/tacazares/pchic/data/peaks/ATAC/CD4_ATAC_peaks.bed": "ATAC",
                 "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/S008H1H1.ERX547940.H3K27ac.bwa.GRCh38.20150527.bed": "H3K27ac",
                 "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/S008H1H1.ERX547958.H3K4me3.bwa.GRCh38.20150527.bed": "H3K4me3",
                 "/Users/caz3so/workspaces/tacazares/pchic/data/peaks/RE/CD4_RE.bed": "RE"}

### Optional: Analyze gene expression data if available

Another optional input is a gene expression matrix. This matrix can be derived from any type of experimental data, but the gene names must match the names in the CHiCAGO input (i.e. ENSEMBL, GENBANK,...). See the files in [RNA data](../data/RNA/) for examples.

In [6]:
ilc_gene_expression = "/Users/caz3so/workspaces/tacazares/pchic/data/RNA/ILC3_mean_expression.tsv"
cd4_gene_expression = "/Users/caz3so/workspaces/tacazares/pchic/data/RNA/CD4_mean_expression.tsv"

In [8]:
cd4_abc = ChicagoData(cd4_ABC, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=cd4_file_dict,
                        gene_expression=cd4_gene_expression,
                        output_dir=out_dir,
                        output_basename="CD4_1M_50K_5kb_within_newbmap_CHiCAGO_ABC_peakm")


Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/ATAC/CD4_ATAC_peaks.bed : Column will be saved as ATAC
Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/S008H1H1.ERX547940.H3K27ac.bwa.GRCh38.20150527.bed : Column will be saved as H3K27ac
Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/S008H1H1.ERX547958.H3K4me3.bwa.GRCh38.20150527.bed : Column will be saved as H3K4me3
Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/RE/CD4_RE.bed : Column will be saved as RE
CHiCAGO    108026
NA          20724
Name: ABC_label, dtype: int64


ValueError: Columns must be same length as key

In [7]:
# Import ILC3 data
ilc3 = ChicagoData(ilc3_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=ilc3_file_dict,
                        gene_expression=ilc_gene_expression,
                        output_dir=out_dir,
                        output_basename="ILC_5kb_within_newbmap_CHiCAGO_ABC_peakm")

# Import cd4_abc file CD4_1M_50K_5kb_within_newbmap_CHiCAGO_ABC_peakm.txt
cd4_abc = ChicagoData(cd4_ABC, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=cd4_file_dict,
                        gene_expression=cd4_gene_expression,
                        output_dir=out_dir,
                        output_basename="CD4_1M_50K_5kb_within_newbmap_CHiCAGO_ABC_peakm")


Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/ATAC/ILC3_ATAC_peaks.bed : Column will be saved as ATAC
Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/ILC3_H3K27ac_peaks.bed : Column will be saved as H3K27ac
Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/CHIP/ILC3_H3K4me3_peaks.bed : Column will be saved as H3K4me3
Importing /Users/caz3so/workspaces/tacazares/pchic/data/peaks/RE/ILC3_RE.bed : Column will be saved as RE
CHiCAGO    66732
NA         13854
Name: ABC_label, dtype: int64


ValueError: Columns must be same length as key

In [61]:
cd4_abc.df.columns

Index(['baitChr', 'baitStart', 'baitEnd', 'baitID', 'baitName', 'oeChr',
       'oeStart', 'oeEnd', 'oeID', 'oeName', 'dist', 'ABC.Score', 'score',
       'merged_score', 'TargetGene', 'remove_line', 'oe_interval_ID',
       'bait_interval_ID', 'interaction_ID', 'ABC_label', 'ATAC', 'H3K27ac',
       'H3K4me3', 'RE'],
      dtype='object')

In [62]:
cd4_abc.df[cd4_abc.df["ABC.Score"] >= .026].shape

(23755, 24)

In [63]:
cd4_abc.df[cd4_abc.df["ABC.Score"] >= .026].head()["merged_score"]

0     5.0
1     5.0
2     5.0
3     5.0
11    5.0
Name: merged_score, dtype: float64

In [73]:
cd4_abc.df[cd4_abc.df["ABC.Score"] <= .026].head()

6     5.415656
7     5.115383
8     5.115383
9     5.044562
10    5.044562
Name: merged_score, dtype: float64

In [65]:
cd4_abc.df[cd4_abc.df["score"] >= 5].shape

(108026, 24)

In [66]:
cd4_abc.df[(cd4_abc.df["ABC.Score"] >= .026) & (cd4_abc.df["score"] >= 5)].shape

(3031, 24)

In [67]:
cd4_abc.df.shape

(128750, 24)

In [82]:
cd4_abc.df["ABC_label"].value_counts()

CHiCAGO    76686
ABC        52064
Name: ABC_label, dtype: int64

### Archived Analysis

In [None]:
# Import cd4 file CD4_1M_50K_5kb_within_newbmap_CHiCAGO_peakm.txt
cd4 = ChicagoData(cd4_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=cd4_file_dict,
                        gene_expression=cd4_gene_expression,
                        output_dir=out_dir,
                        output_basename="CD4_1M_50K_5kb_within_newbmap_CHiCAGO_peakm")

# Import old CD4_1M data
cd4_1M_old = ChicagoData(cd4_1M_old_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="M1",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=cd4_file_dict,
                        gene_expression=cd4_gene_expression,
                        output_dir=out_dir,
                        output_basename="CD4_1M_50K_merged_reweighting_peakmatrix_score5")

# Import old ILC3 data
ilc3_old = ChicagoData(ilc3_old_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="hILC3_merged_bin5k",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=ilc3_file_dict,
                        gene_expression=ilc_gene_expression,
                        output_dir=out_dir,
                        output_basename="ILC3_merged_bin5K_score5")

# Import cd4 file CD4_1M_50K_5kb_within_newbmap_CHiCAGO_peakm.txt
cd4 = ChicagoData(cd4_file, 
                        drop_off_target_bait=True, 
                        drop_off_target_oe=False, 
                        drop_trans_chrom=True,
                        score_col="merged_score",
                        score_val=5,
                        remove_p2p=True,
                        features_to_count=cd4_file_dict,
                        gene_expression=cd4_gene_expression,
                        output_dir=out_dir,
                        output_basename="CD4_1M_50K_5kb_within_newbmap_CHiCAGO_peakm")
