# Building a regulation prior network for netZoo tools
Marouen Ben Guebila<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

## Introduction

Several Network Zoo [netzoo](netzoo.github.io) tools require a regulation prior network ($W_0$) to use it as an initial estimate and a starting point for the inference of the final network $W$. Regulation prior networks are based on Transcription Factor Binding Sites (TFBS) detected in the promoter regions of target genes.

In this tutorial, we will go through the following steps to reconstruct a regulation prior for netzoo tools. 

- First, we will extract the sequences of the promoter regions of human genes.

- Second, we will use a database of TF Position Weight Matrices (PWMs), that associates to each TF a sequence motif where the TF is likely to bind.

- Third, we will scan the sequences of the promoters for TFBS using TF PWMs and a scan tool called FIMO<sup>1</sup>.

- Finally, we will derive the regulation prior network as a discrete binary network and a continuous network by using several derivations. We are particularly interested in the continuous derivations because [as shown previously](Controlling_The_Variance_Of_PANDA_Networks.ipynb) binary priors induce a strong bias on the final network.

Some parts of this notebook are intended for demonstration purposes only, because running the whole pipeline takes about a week on a 36 core machine. The variable `precomputed` was set to 1 to avoid executing the code on the server of the time-consuming part. However, you can try this code on your own cluster by:

- Installing the dependencies listed below,

- Changing the paths with those of your machine,

- Or you can directly use the final computed networks that we provide at the end.

## Loading the required libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
from Bio import SeqIO # to run Biopython
import pandas as pd
import multiprocessing # to run FIMO in parallel
from functools import partial
from netZooPy.panda.panda import Panda # to build a GRN network

## Extracting the sequences of promoter regions

First, you need the sequences of human genes from the latest builld (hg38). The sequences can be downloaded from the UCSC website https://genome.ucsc.edu/cgi-bin/hgGateway. When you download the gene sequences, you have the option to pick the start nucleotide in relation to the Transcription Start Site (TSS) and the end nucleotide relative to the Transcription End Site (TSE). Here, we chose gene sequences that start at TSS-1000 base pairs and end at TSE+1000 basepairs. In total, there are 38723 gene sequences.

Since, we are interested in the promoter regions of the gene, we need to reduce the sequence. We are interested in the region that is TSS+/-1kb, therefore we need to take the first 2kb of each sequence using the following function.

In [None]:
def reduceSequence(sequence):
    seq=sequence[:2000]#since start is tss-1000, then we take the 2000bp upstream 
    return seq

Also, we will convert gene names from ENSG to gene symbols, using this file

In [None]:
# read conversion file
geneCorr = pd.read_csv('/opt/data/netZooPy/regPrior/hg38_Tss_coordinates.csv',sep='\t')

Then, we can cut all the sequences.

In [None]:
precomputed=1
input_file ='/opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tse1000.fasta'
output_file='/opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tss1000.fasta'
if precomputed==0:
    fasta_sequences = SeqIO.parse(open(input_file),'fasta')
    finalSeq=''
    namelist=[]
    for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        new_sequence   = reduceSequence(sequence)
        name           = name[16:]
        boolInd        = np.in1d(geneCorr.iloc[:,1],name)
        name           = geneCorr.iloc[boolInd,12].values[0]
        interName      = np.intersect1d(name, namelist)
        if interName.size == 0:
            namelist.append(name)
        else:
            continue
        name           = '>' + name
        print(name)
        finalSeq       = finalSeq + name + '\n' + new_sequence + '\n'

Finally, we can save the trimmed sequences.

In [None]:
if precomputed==0:
    # save file
    with open(output_file, 'w') as file:
        file.write(finalSeq)

## Getting and cleaning position weight matrices (PWMs)
The second step is to collect the PWMs for each TF which characterize their DNA binding motifs. The PWMs will allow us afterwards to scan the sequences obtained earlier for TFBS using the FIMO<sup>1</sup> software. 

We collected PWMs from [the companion website](http://humantfs.ccbr.utoronto.ca/download.php) to Lambert et al.,<sup>2</sup> which correspond to the database [CIS-BP](http://cisbp.ccbr.utoronto.ca/) 1.94d. In total, there are PWMs for 1149 TFs.

First, we need to put the PWMs in the format required for the sequence scanning tool FIMO. FIMO requires a `.meme` format for PWMs, therefore we need to convert PWMs from matrices to meme using [matrix2meme](http://meme-suite.org/doc/matrix2meme.html) from the [meme suite](http://meme-suite.org/).

In [None]:
if precomputed==0:
    # convert CIS-BP matrices to matrices
    for file in os.listdir():
        df=pd.read_csv(file,sep='\t')
        df=df.iloc[:,1:]
        os.chdir('/opt/data/netZooPy/regPrior/convPWM')
        df.to_csv(file, header=False, index=False, sep='\t')
        os.chdir('/data/PWMs')

    # call meme suite meme2mat, some files were not analyzed because some nucleotide positions summed to zero
    os.chdir('/opt/data/netZooPy/regPrior/convPWM')
    finalTfList=[]
    for file in os.listdir():
        bashCommand = "meme/libexec/meme-5.0.5/matrix2meme <" + file + "> " + file + ".meme"
        res=os.system(bashCommand)
        if res != 0:
            print(file)
        else:
            finalTfList.append(file[:-4])

Next, since TFs may have more than one DNA binding motif, we will select the best motif for each TF as specified by Lambert et al.,<sup>2</sup>, by looking for the boolean `true` in the column `Best motif(s)` in the metadata file `Human_TF_MotifList_v_1.01.csv`.

In [None]:
if precomputed==0:
    # read TF motif table to select best motif per TF
    tf     =pd.read_csv("/opt/data/netZooPy/regPrior/Human_TF_MotifList_v_1.01.csv",dtype=str)
    indTF  =np.in1d(tf.iloc[:,6],finalTfList)
    tff    =tf
    tf     =tf.iloc[indTF,:]
    initTF =tf.iloc[0,1]
    tflist, tfflist =[],[]
    tflist.append(tf.iloc[0,6])
    tfflist.append(initTF)
    tfFound=0
    for i in range(tf.shape[0]):
        newTF = tf.iloc[i, 1]
        if tfFound==1 and initTF==newTF:
            continue
        else:
            if initTF!=newTF:
                if initTF != newTF and tfFound==0:
                    tflist.append(tf.iloc[i, 6])
                    tfflist.append(tf.iloc[i, 1])
                tfFound=0
                initTF =newTF
            if tf.iloc[i,7]==True:
                tflist.append(tf.iloc[i,6])
                tfflist.append(tf.iloc[i, 1])
                tfFound=1

Finally, we can put the converted PWMs in a single file to simplify the analysis.

In [None]:
if precomputed==0:
    # put pwms in the same file
    os.chdir('/opt/data/netZooPy/regPrior/convPWM')
    finalMeme = 'MEME version 4\n\nALPHABET= ACGT\n\nstrands: + -\n\nBackground letter frequencies (from uniform background):\nA 0.25000 C 0.25000 G 0.25000 T 0.25000 \n\n'
    k=0
    finalTFName=[]
    for i in range(len(tflist)):
        try:
            file = open(tflist[i] + '.txt.meme', 'r')
            k    = k+1
            finalTFName.append(tfflist[i])
            data = file.read()
            finalMeme = finalMeme + data[145:151] + tfflist[i] + data[152:]
            file.close()
        except:
            print("TF not found")

    # save file
    with open('allTFs.meme', 'w') as file:
        file.write(finalMeme)

    # save TFs
    with open("tfNames.txt", "w") as f:
        for s in finalTFName:
            f.write(str(s) +"\n")

## Building the regulation prior network

In this step, we will scan the promoter sequences for TF binding motifs using FIMO. Our final network will determine the interactions between 1149 TFs and 38723 genes. First, we need to extract gene names and the number of TFs.

In [None]:
tfNames=[]
with open("/opt/data/netZooPy/regPrior/tfNames.txt", "r") as f:
  for line in f:
    tfNames.append(str(line.strip()))

input_file = '/opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tss1000.fasta'
geneNames  = []
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        geneNames.append(name)

nTFs       = len(tfNames)

Then we will run `FIMO` on each gene sequence and for each TF. The output of `FIMO` is a p-value that indicates the likelihood of binding of a given TF in the target sequence using the PWMs that were selected as best in the previous step. Since a TF can have several binding sites in the target sequence, the p-values can be sumamrized using [Fisher's method](https://en.wikipedia.org/wiki/Fisher%27s_method), however, in our case, we will pick the lowest p-value for each TF-gene pair.

Before writing the parallel loop, let's discuss the output network of this step. 

Networks can be binary wherein a value of 1 indicates the presence of a motif in the promoter region and 0 the absence  of a TFBS. The binary values are obtained by thresholding the p-value. We can pick a significance threshold on either the p-values or the multiple testing corrected p-values (q-values). Therefore, we get two binary networks: a p-value thresholded network and a q-value threhsolded network.

However, we saw [previously](Controlling_The_Variance_Of_PANDA_Networks.ipynb) that seeding network inference with a binary network induces a strong bias on the distribution of the edges. Therefore, we are interested in deriving a continuous $W_0$.

To assess the strength of regulation, models of continuous binding have been used on ChIP-seq binding profiles as functions that decrease monotonically with the distance to TSS<sup>3</sup>. Although these models have been used with ChIP-seq binding peaks over large sequence regions (~1MB), in our case we will apply them to PWM hits over smaller sequences (2kb). The difference is that with ChIP-seq data, TF binding has been established however with TF DNA binding motif scans, binding is hypothetical or proven in vitro at best<sup>4</sup>.

The following continuous models are structurally simialr with vairations in the parameters.

### 1. Garcia-Alonso model
This model has been applied on ChIP-seq data<sup>5</sup> to compute the strength of regulation $s$ between a gene $g$ and a TF $t$ using the following equation

$
\begin{equation}
s(t,g)=\sum_{k} e^{-d/(md\times10+1)}
\end{equation}
$

such as $k$ is the number of binding sites for a given TF in the target sequence, $d$ is the distance between the $k^{th}$ detected motif in the sequence and the TSS, $md$ is the median of all $k$ motifs.

Since we are considering a region of TSS+/-1kb, we will consider the motifs occuring after the TSS as having a distance of 0 to the TSS for this model and all the other models. Therefore, we added 1 in the denominator of the exponential to account for cases where the all the motifs are after the TSS.

We can get an intuition of the model by running a with two different median values, one with a scaling factor and one without a scaling factor.

In [None]:
md=200
x=np.arange(0,1000)
plt.plot(np.concatenate([-np.flip(x),x[1:]]), np.concatenate([np.exp(-np.flip(x) / (10*md)), np.ones(999) ]), label="md200" )
plt.legend()
plt.xlabel('Distance to TSS (bp)')
plt.ylabel('Strength of regulation')

md=200
x=np.arange(0,1000)
plt.plot(np.concatenate([-np.flip(x),x[1:]]), np.concatenate([np.exp(-np.flip(x) / (md)), np.ones(999)]), label="md200 without factor" )
plt.legend()

Adding a scaling factor of 10 seems to give better results since we're working on short sequences.

### 2. Ouyang model
The previous model is a modification of the Ouyang model<sup>6</sup> which has two main differences:

- A scaling parameter that was set in the exponential to 5000bp for all TFs except E2f1 that was set to 500bp. In the  previous model, the parameter was replaced by the median of all binding sites, therfore we will keep this modification for our study.

- A multiplicative factor to account for ChIP-seq reads mapped on the binding site. However, in our case, we will replace the ChIP mapped reads factor by $-log_{10}(p-value)$ to integrate the significance of binding into the computed score.

Therefore, the final model equation is: 

$
\begin{equation}
s(t,g)=\sum_{k} -log_{10}(p-value)\times e^{-d/(md\times10+1)}
\end{equation}
$

To get a sense of this function, we can plot the score for one TF assuming a p-value of $10^{-3}$.

In [None]:
pval=0.001
md=200
x=np.arange(0,1000)
plt.plot(np.concatenate([-np.flip(x),x[1:]]), np.concatenate([-np.log10(pval)*np.exp(-np.flip(x) / (10*md)), 3*np.ones(999) ]) )
plt.xlabel('Distance to TSS (bp)')
plt.ylabel('Strength of regulation')

### 3. RP model

The Regulatory Potential (RP) model<sup>7</sup> assumes a decay function as the distance from the TSS increases.

$
\begin{equation}
s(t,g)=\sum_{k} e^{(-0.5+4\times d)}
\end{equation}
$

with $d$ the distance to TSS divided by 1000. 

In [None]:
md=200
x=np.arange(0,1000)/1000
plt.plot(np.concatenate([-np.flip(x),x[1:]]), np.concatenate([np.exp(-(0.5+4*np.flip(x))), np.exp(-0.5)*np.ones(999) ]) )
plt.xlabel('Distance to TSS (kb)')
plt.ylabel('Strength of regulation')

The RP model was later modified<sup>8</sup> to include a parameter that is calibrated using the study data set.

There could be many more derivations of continuous regulation strength using gene expression or the binding affinity of TFs. However, we will compute the three previous metrics for our model.

## Computing the motif networks

Now, let's define a parallel loop to call FIMO on our sequences and compute two binary network and three continuous networks.

In [None]:
def parallelFIMO(tfi,geneNames,tfNames,pqval):
	regMatQval = np.zeros((1,len(geneNames)))
	regMatQ    = pd.DataFrame(data=regMatQval, columns=geneNames, index=[tfNames[tfi]])
	print(tfi)
	if pqval==1:
		bashCommand = 'fimo --qv-thresh --thresh 0.05 --verbosity 1 --max-stored-scores 100000000 --oc motifResultsTmp' + str(tfi+1)  + ' --motif ' + tfNames[tfi] \
		+ ' /opt/data/netZooPy/regPrior/allTFs.meme /opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tss1000.fasta'
	else:
		bashCommand = 'fimo --thresh 1e-3 --no-qvalue --verbosity 1 --max-stored-scores 100000000 --oc motifResultsTmp' + str(tfi+1)  + ' --motif ' + tfNames[tfi] \
		+ ' /opt/data/netZooPy/regPrior/allTFs.meme /opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tss1000.fasta'
	res = os.system(bashCommand)
	if res != 0:
		print('could not read fimo')
	os.chdir('/opt/data/netZooPy/regPrior/motifResultsTmp' + str(tfi+1))
	try:
		tf = pd.read_csv('fimo.tsv',sep='\t', comment='#')
		tf.iloc[:,3] = 1000 - tf.iloc[:,3] # center index to get TSS=0
		for geneName in geneNames:
			jset = np.where(np.in1d(tf.iloc[:,2],geneName))
			if jset[0].size != 0:
				if pqval in (2, 3, 4):
					# for continuous networks assume that position>TSS is equal to TSS
					indMil=np.where(tf.start < 0)
					tf.iloc[indMil[0],3]=0
				if pqval==1:
					regMatQ[geneName].iloc[0] = tf.iloc[jset[0],8].min()
				elif pqval==0:
					regMatQ[geneName].iloc[0] = tf.iloc[jset[0],7].min()
				elif pqval==2:
					md=np.median(tf.iloc[jset[0],3])
					regMatQ[geneName].iloc[0] = np.sum(np.exp((-tf.iloc[jset[0],3])/((md*10)+1)))
				elif pqval==3:
					md=np.median(tf.iloc[jset[0],3])
					if any(np.isposinf(np.log10(tf.iloc[jset[0],7]))):
						print('Infinite value')
					tmpRes = np.multiply(-np.log10(tf.iloc[jset[0],7]),  np.exp((-tf.iloc[jset[0],3])/((md*10)+1)))
					regMatQ[geneName].iloc[0] = np.sum(tmpRes)
				elif pqval==4:
					tmpRes=np.exp( -0.5+4 * (tf.iloc[jset[0],3]/1000) )
					regMatQ[geneName].iloc[0] = np.sum(tmpRes)
	except:
		print('empty fimo result')
	os.chdir('/opt/data/netZooPy/regPrior/data')
	os.system('rm -rf motifResultsTmp' + str(tfi+1))
	return regMatQ

Now, we can call the loop in parallel, however, it may take some time to compute, therefore we provide the precomputed results. To avoid memory issues and reduce the computation burden, we took the highest threshold of p-value and q-value to be 0.001.

In [None]:
computeNetworks=0
if computeNetworks==1:
    # p-value binary network
    pool = multiprocessing.Pool(36)
    res  = pool.map(partial(parallelFIMO,tfNames=tfNames,geneNames=geneNames,pqval=0), range(nTFs))
    res  = pd.concat(res)
    res.to_csv('regMatPval1e3.csv')

    # q-value binary network
    pool = multiprocessing.Pool(36)
    res  = pool.map(partial(parallelFIMO,tfNames=tfNames,geneNames=geneNames,pqval=1), range(nTFs))
    res  = pd.concat(res)
    res.to_csv('regMatQval1e3.csv')

    # Continuous network 1
    pool = multiprocessing.Pool(36)
    res  = pool.map(partial(parallelFIMO,tfNames=tfNames,geneNames=geneNames,pqval=2), range(nTFs))
    res  = pd.concat(res)
    res.to_csv('regMatCont1.csv')

    # Continuous network 2
    pool = multiprocessing.Pool(36)
    res  = pool.map(partial(parallelFIMO,tfNames=tfNames,geneNames=geneNames,pqval=3), range(nTFs))
    res  = pd.concat(res)
    res.to_csv('regMatCont2.csv')

    # Continuous network 3
    pool = multiprocessing.Pool(36)
    res  = pool.map(partial(parallelFIMO,tfNames=tfNames,geneNames=geneNames,pqval=4), range(nTFs))
    res  = pd.concat(res)
    res.to_csv('regMatCont3.csv')

## Processing the final network
### Binary networks
We computed two binary networks: the first one is and FDR-corrected p-value network thresholded at 0.05 and the second is a p-value network thresholded at $1e^{-5}$. In other words, if the significance of binding determined by FIMO scan is less than a certain significance threshold, we will assign a binding event and the edge weight will be set to 1, otherwise the edge will be set to 0.

In [None]:
regmatqval=pd.read_csv('/opt/data/netZooPy/regPrior/regMatQval005.csv',header=0,index_col=0)
tresh=0.05

In [None]:
regmatqval[(regmatqval>0) & (regmatqval <= tresh)]=1
regmatqval[(regmatqval > tresh) & (regmatqval < 1)] =0
plt.hist(regmatqval.values.flatten())

In [None]:
regmatpval=pd.read_csv('/opt/data/netZooPy/regPrior/regMatPval1e3.csv',header=0,index_col=0)
tresh=1e-5

In [None]:
regmatpval[(regmatpval>0) & (regmatpval <= tresh)]=1
regmatpval[(regmatpval > tresh) & (regmatpval < 1)] =0
plt.hist(regmatpval.values.flatten())

### Scaling the continuous networks
We will first start by exploring the edge distribution that each continuous network has and scale them in order to be able to use them for network reconstruction methods.

### 1. Garcia-Alonso model

We start by loading the network that we computed to explore basic statistics.

In [None]:
cont1=pd.read_csv('/opt/data/netZooPy/regPrior/regMatCont1.csv',header=0,index_col=0)

print('the maximum value is ',np.max(cont1.max()))
print('the minimum value is ',np.min(cont1.min()))
print('Are there null values?', cont1.isnull().values.any())
print('The dimensions are:', cont1.shape)
print('Matrix density is', np.count_nonzero(cont1)/(cont1.shape[0]*cont1.shape[1]))

Our network includes 1149 TFs and 38723 genes, the largest edge weight is 868.5 and the lowest is 0. The density of the adjaceny matrix is about 91%.

In [None]:
plt.hist(cont1.values.flatten())
plt.yscale('log', nonposy='clip')
plt.ylabel('Frequency')
plt.xlabel('Edge weight')

We can see that the values are a bit spread, therefore we need to quantile normalize them. Then, we need to scale the values between 0 and 1 to have the same scale of the other input matrices to netZoo tools such as coexpression (between -1 and 1) and PPI matrix (between 0 and 1).

In [None]:
def quantileNormalize(df_input):
    df = df_input.copy()
    #compute rank
    dic = {}
    for col in df:
        dic.update({col : sorted(df[col])})
    sorted_df = pd.DataFrame(dic)
    rank = sorted_df.mean(axis = 1).tolist()
    #sort
    for col in df:
        t = np.searchsorted(np.sort(df[col]), df[col])
        df[col] = [rank[i] for i in t]
    return df

In [None]:
cont1qnorm = quantileNormalize(cont1)
cont1qnorm = cont1qnorm/(np.max(cont1qnorm.max()))

In [None]:
plt.hist(cont1qnorm.values.flatten())
plt.yscale('log', nonpositive='clip')
plt.ylabel('Frequency')
plt.xlabel('Edge weight')

### 2. Oyuang model
We do the same for the second model. Matrix density is identical to the first one, however, the edge weights are more dispersed because we added a significance term to each distance. Edge values are between 0 and 4624 and matrix density is 91% which is equal to the density of the previous model because the only difference was the addition of a significance term.

In [None]:
cont2=pd.read_csv('/opt/data/netZooPy/regPrior/regMatCont2.csv',header=0,index_col=0)

print('the maximum value is ',np.max(cont2.max()))
print('the minimum value is ',np.min(cont2.min()))
print('Are there null values?', cont2.isnull().values.any())
print('The dimensions are:', cont2.shape)
print('Matrix density is', np.count_nonzero(cont2)/(cont2.shape[0]*cont2.shape[1]))

In [None]:
plt.hist(cont2.values.flatten());
plt.yscale('log', nonposy='clip');
plt.ylabel('Frequency');
plt.xlabel('Edge weight');

In [None]:
cont2qnorm = quantileNormalize(cont2)
cont2qnorm = cont2qnorm/(np.max(cont2qnorm.max()))

In [None]:
plt.hist(cont2qnorm.values.flatten());
plt.yscale('log', nonpositive='clip');
plt.ylabel('Frequency');
plt.xlabel('Edge weight');

### 3. RP model
The RP model edge weight distribution varies between 0 and 7513. The weighted adjacency matrix has the same density than the two other matrices (91%). When we look at the equation of the RP model, we see that when the motif is exactly at the TSS or after the TSS, the distance $d$ to the TSS was set to 0. In the RP model equation, a distance 0 gives an edge weight of 0.6.

In [None]:
cont3=pd.read_csv('/opt/data/netZooPy/regPrior/regMatCont3.csv',header=0,index_col=0)

print('the maximum value is ',np.max(cont3.max()))
print('the minimum value is ',np.min(cont3.min()))
print('Are there null values?', cont3.isnull().values.any())
print('The dimensions are:', cont3.shape)
print('Matrix density is', np.count_nonzero(cont3)/(cont3.shape[0]*cont3.shape[1]))

In [None]:
plt.hist(cont3.values.flatten())
plt.yscale('log', nonposy='clip')
plt.ylabel('Frequency')
plt.xlabel('Edge weight')

In [None]:
cont3qnorm = quantileNormalize(cont3)
cont3qnorm = cont3qnorm/(np.max(cont3qnorm.max()))

In [None]:
plt.hist(cont3qnorm.values.flatten())
plt.yscale('log', nonpositive='clip')
plt.ylabel('Frequency')
plt.xlabel('Edge weight')

## References

1- Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. "FIMO: scanning for occurrences of a given motif." Bioinformatics 27.7 (2011): 1017-1018.

2- Lambert, Samuel A., et al. "The human transcription factors." Cell 172.4 (2018): 650-665.

3- Tang, Qianzi, et al. "A comprehensive view of nuclear receptor cancer cistromes." Cancer research 71.22 (2011): 6940-6947.

4- Jolma, Arttu, et al. "DNA-binding specificities of human transcription factors." Cell 152.1-2 (2013): 327-339.

5- Garcia-Alonso, Luz, et al. "Benchmark and integration of resources for the estimation of human transcription factor activities." Genome research 29.8 (2019): 1363-1375.

6- Ouyang, Zhengqing, Qing Zhou, and Wing Hung Wong. "ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells." Proceedings of the National Academy of Sciences 106.51 (2009): 21521-21526.

7- Tang, Qianzi, et al. "A comprehensive view of nuclear receptor cancer cistromes." Cancer research 71.22 (2011): 6940-6947.

8- Chen, Chen-Hao, et al. "Determinants of transcription factor regulatory range." Nature communications 11.1 (2020): 1-15.