Skip to content
Analyze the neighborhood (upstream/downstream neighbors) of any gene set
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data-raw
data
man
tests
vignettes
.Rbuildignore
.gitignore
.travis.yml
DESCRIPTION
GeneNeighborhood.Rproj
LICENSE
NAMESPACE
README.Rmd
README.md
appveyor.yml
codecov.yml

README.md

Travis build status AppVeyor build status Coverage status

GeneNeighborhood

The goal of GeneNeighborhood is to extract and analyze the orientation and proximity of upstream/downstream genes.

Installation

The development version of the GeneNeighborhood package can be installed from GitHub with:

# install.packages("devtools")
devtools::install_github("pgpmartin/GeneNeighborhood")

Example

Load the library:

library(GeneNeighborhood)

Obtain data on the gene neighbors

To run examples, the package includes a GRanges named Genegr that contains 676 genes with random coordinates on a single chromosome Chr1:

Genegr
#> GRanges object with 676 ranges and 0 metadata columns:
#>      seqnames       ranges strand
#>         <Rle>    <IRanges>  <Rle>
#>   AA     Chr1 [2876, 2884]      +
#>   AB     Chr1 [7884, 7886]      +
#>   AC     Chr1 [4090, 4094]      +
#>   AD     Chr1 [8831, 8835]      +
#>   AE     Chr1 [9405, 9412]      +
#>   ..      ...          ...    ...
#>   ZV     Chr1 [5214, 5220]      +
#>   ZW     Chr1 [ 863,  868]      +
#>   ZX     Chr1 [2831, 2836]      +
#>   ZY     Chr1 [4205, 4205]      -
#>   ZZ     Chr1 [5873, 5873]      -
#>   -------
#>   seqinfo: 1 sequence from mock genome

For each feature/gene, we extract information (orientation and distance, potential overlaps) about their upstream/downstream neighbors with:

GeneNeighbors <- getGeneNeighborhood(Genegr)
#> There are 96 genes (14.2%) that overlap with >1 gene
#> More than 10% of the genes overlap with multiple genes

Analyze the orientation of the genes' neighbors

We define a random set of 100 genes:

set.seed(123)
randGenes <- sample(names(Genegr), 100)

We extract statistics about the orientation of their neighbors using:

## Neighbors Orientation Statistics:
NOS <- analyzeNeighborsOrientation(randGenes, 
                                   GeneNeighborhood = GeneNeighbors)

By default all genes are used as a universe and an enrichment test is performed. By default, the function also analyzes the "other" orientation which may be hard to interpret. We can remove this orientation using:

NOS <- analyzeNeighborsOrientation(randGenes, 
                                   GeneNeighborhood = GeneNeighbors,
                                   keepOther = FALSE)
#> Total number of genes in GeneList: 100 
#> Length of Gene Universe is 676 
#> 
#> Analysis of upstream gene orientation:
#> ======================================
#> 21 genes with 'other' info for their upstream gene are removed
#> Gene set for upstream gene analysis has 79 genes
#> 2 genes from universe have missing data for upstream gene
#> 157 genes from universe with 'other' info for their upstream gene are removed
#> Universe for upstream gene analysis has 517 genes
#> 
#> Analysis of downstream gene orientation:
#> ========================================
#> 21 genes with 'other' info for their downstream gene are removed
#> Gene set for downstream gene analysis has 79 genes
#> 1 genes from universe have missing data for downstream gene
#> 157 genes from universe with 'other' info for their downstream gene are removed
#> Universe for downstream gene analysis has 518 genes

We obtain the following table:

Side Orientation n Percentage n_Universe Percentage_Universe p.value
Upstream OppositeOverlap 4 5.06 35 6.77 0.810
Upstream OppositeStrand 38 48.10 236 45.65 0.360
Upstream SameOverlap 4 5.06 31 6.00 0.730
Upstream SameStrand 33 41.77 215 41.59 0.530
Downstream OppositeOverlap 5 6.33 16 3.09 0.081
Downstream OppositeStrand 40 50.63 248 47.88 0.340
Downstream SameOverlap 4 5.06 31 5.98 0.720
Downstream SameStrand 30 37.97 223 43.05 0.870

We can plot the corresponding percentages using:

plotNeighborsOrientation(NOS)

Analyze the proximity of the genes' neighbors

We can analyze specifically the distances to the upstream genes with (here using 1000 bootstrap replicates to estimate the 95% confidence intervals of the mean and median):

randUpstreamDist <- statDistanceSide(GeneNeighborhood = GeneNeighbors,
                                     glist = randGenes,
                                     Side = "Upstream",
                                     nboot = 1e3)

Which gives the following (simplified) table:

GeneGroup SideClass n Median Mean SD KS.pvalue Wilcox.pvalue Independ.pvalue
GeneList OppositeStrand 38 9.5 14.47 13.81 0.65 0.63 0.52
GeneList SameStrand 33 10.0 13.00 9.80 0.91 0.98 0.51
GeneUniverse OppositeStrand 236 10.0 13.30 12.43 NA NA NA
GeneUniverse SameStrand 215 11.0 14.47 13.80 NA NA NA

Or we directly analyze the distances to both upstream and downstream genes:

randDist <- analyzeNeighborsDistance(GeneList = randGenes,
                                     GeneNeighborhood = GeneNeighbors,
                                     nboot = 1e3)

Which gives the following (simplified) table:

GeneGroup Side Orientation n Median Mean SD KS.pvalue Wilcox.pvalue Independ.pvalue
GeneList Upstream OppositeStrand 38 9.5 14.47 13.81 0.65 0.63 0.520
GeneList Upstream SameStrand 33 10.0 13.00 9.80 0.91 0.98 0.510
GeneUniverse Upstream OppositeStrand 236 10.0 13.30 12.43 NA NA NA
GeneUniverse Upstream SameStrand 215 11.0 14.47 13.80 NA NA NA
GeneList Downstream SameStrand 30 15.5 18.70 15.54 0.29 0.10 0.098
GeneList Downstream OppositeStrand 40 7.0 15.20 18.69 0.62 0.50 0.810
GeneUniverse Downstream OppositeStrand 248 10.0 14.65 15.66 NA NA NA
GeneUniverse Downstream SameStrand 223 12.0 14.85 13.69 NA NA NA

The alldist$stats table also contains 95% bootstrap confidence intervals for the mean and median of the GeneList set:

GeneGroup Side Orientation n Median_lowerCI Median Median_upperCI Mean_lowerCI Mean Mean_upperCI
GeneList Upstream OppositeStrand 38 7.0 9.5 14.5 11.00 14.47 20.09
GeneList Upstream SameStrand 33 5.0 10.0 10.0 10.21 13.00 16.91
GeneList Downstream SameStrand 30 8.5 15.5 22.5 14.19 18.70 25.32
GeneList Downstream OppositeStrand 40 4.0 7.0 16.0 9.94 15.20 22.12

The analyzeNeighborsDistance function can also be used to extract intergenic distances for all genes (except those with overlapping genes):

alldist <- analyzeNeighborsDistance(GeneList = names(Genegr),
                                    GeneNeighborhood = GeneNeighbors,
                                    DistriTest = FALSE,
                                    nboot = 1e3)

We can use these distances to preferentially select genes with a short upstream distance. First, we extract upstream distances:

updist <- alldist$distances$Distance[alldist$distances$Side=="Upstream"]
names(updist) <- alldist$distances$GeneName[alldist$distances$Side=="Upstream"]

Then we define a probability of selecting the gene that is inversely proportional to its upstream distance:

probs <- (max(updist) - updist) / sum(max(updist) - updist)

Then select 100 genes using these probabilities:

set.seed(1234)
lessRandGenes <- sample(names(updist), 100, prob=probs)

And finally analyze the intergenic distances with these genes' neighbors:

lessRandDist <- analyzeNeighborsDistance(GeneList = lessRandGenes,
                                         GeneNeighborhood = GeneNeighbors)
#> Warning in norm.inter(t, adj.alpha): extreme order statistics used as
#> endpoints

We obtain the following (simplified) table:

GeneGroup Side Orientation n Median Mean SD KS.pvalue Wilcox.pvalue Independ.pvalue
GeneList Upstream SameStrand 50 5.0 8.52 9.05 0.00038 8.9e-05 0.00051
GeneList Upstream OppositeStrand 50 9.0 9.92 9.12 0.26000 4.8e-02 0.03100
GeneUniverse Upstream OppositeStrand 236 10.0 13.30 12.43 NA NA NA
GeneUniverse Upstream SameStrand 215 11.0 14.47 13.80 NA NA NA
GeneList Downstream OppositeStrand 39 8.0 14.00 17.84 0.30000 3.3e-01 0.78000
GeneList Downstream SameStrand 50 12.5 16.18 13.35 0.94000 3.2e-01 0.43000
GeneUniverse Downstream OppositeStrand 248 10.0 14.65 15.66 NA NA NA
GeneUniverse Downstream SameStrand 223 12.0 14.85 13.69 NA NA NA

Before plotting, we need to assemble the distances for different groups of genes in a single data frame:

mydist <- rbind.data.frame(alldist$distances,
                           randDist$distances,
                           lessRandDist$distances)
mydist$GeneSet <- rep(c("All Genes", "Random Genes", "Less Random genes"),
                      times = c(nrow(alldist$distances), 
                                nrow(randDist$distances), 
                                nrow(lessRandDist$distances)))

Now, we can plot the distribution of intergenic distances for these sets of genes:

plotDistanceDistrib(mydist)

Metagene profiles

Another way to study the neighborhood of a set of genes is to produce an average profiles representing the coverage of annotations in regions surrounding this set of genes. Such representation provides information on both the orientation and the distance of the neighbors. It also allows to compare different groups of genes (e.g. upregulated genes, genes with a ChIP-seq peak or a specific transcription factor motif in their promoter, etc...).

First we extract the profiles of annotations around (+/-50bp) all our "mock" genes. We use 3 bins to summarize the coverage on the gene body so genes of size <3bp are removed. We also use the argument usePercent=TRUE so that the profiles only indicate the presence/absence (0/1) of an annotation at a position rather than the number of genes that cover this position.

usePercent = TRUE
Prof <- annotationCoverageAroundFeatures(Genegr,
                                         sidedist = 50,
                                         usePercent = usePercent,
                                         nbins=3)
#> 4 windows exceeding chromosome borders are removed
#> Features are binned using binFeatureProfiles
#> Removing 136 features of size lower than 3bp
#> Bin size is: 2.13 +/- 0.74bp (mean +/- sd)

The object that is created by this function contains the following elements:

#> Feature_Sense
#> Feature_Antisense
#> UpstreamBorder_Sense
#> UpstreamBorder_Antisense
#> DownstreamBorder_Sense
#> DownstreamBorder_Antisense

They represent the strand-specific coverage of annotations on the gene body (Feature), in the region (-50bp to the start of the gene) located upstream of the genes (UpstreamBorder) and in the region (end of the gene to +50bp) located downstream of the genes (DownstreamBorder). The Sense strand is the strand on which the focus gene (Feature) is annotated and the Antisense strand is the opposite strand. Because we used usePercent=TRUE, the Feature_Sense profiles will only contain 1s, indicating the presence of annotation(s) all along the body of each focus gene (i.e. annotation of the focus gene itself). When usePercent=FALSE, values >1 can occur in these profile, when the focus gene overlaps with other genes annotated on the same strand.

Then, we assemble these different elements in order to produce a vector for each focus gene containing the upstream region (-50bp to the TSS), the gene body itself and the downstream region (TES to +50bp).

Prof <- assembleProfiles(Prof)

Now we define a group of genes that have neighbors, on the same strand, at a short distance.

#Get distances to the closest gene on the same strand:
Dist2Nearest <- mcols(distanceToNearest(Genegr))$distance
CloseTandemNeighbors <- names(Genegr)[Dist2Nearest<=8]

We assemble the different groups of genes that we have defined so far in a list:

GeneGroups <- list(All = names(Genegr),
                   Random = randGenes,
                   CloseNeighbors = CloseTandemNeighbors)

Then, for each group of genes, we calculate the average profile and its 95% confidence interval:

avgProf <- list()
for (i in 1:length(GeneGroups)) {
  avgProf[[i]] <- list()
  avgProf[[i]]$sense <- getAvgProfileWithCI(Prof$Profiles_Sense,
                                            selFeatures = GeneGroups[[i]],
                                            pos = c(-50:0, 1:3, 0:50))
  avgProf[[i]]$antisense <- getAvgProfileWithCI(Prof$Profiles_Antisense,
                                                selFeatures = GeneGroups[[i]],
                                                pos = c(-50:0, 1:3, 0:50))
}
names(avgProf) <- names(GeneGroups)

Before plotting, we need to assemble these profiles in a single table and to provide the x-coordinates for these "metagene" profiles (in the interval [0,5] with the gene body occupying coordinates ]2,3[).

#Define the x-coordinates (the 3 sequences correspond to "Upstream", "GeneBody" and "Downstream")
xcoord = c(seq(0, 2, length.out = 51),
           seq(2, 3, length.out = 5)[2:4],
           seq(3, 5, length.out = 51))
#Assemble the metagene profiles
avgProf_df <- reshape2::melt(avgProf,
                             measure.vars = "Profile", value.name = "Profile") %>%
                  dplyr::rename("Strand" = "L2",
                                "GeneSet" = "L1") %>%
                  dplyr::mutate(Xcoord=rep(xcoord, 2*length(GeneGroups)))

Now we can plot these profiles using:

plotMetageneAnnotProfile(avgProf_df)

Distance to single-point features

One limitation of these average profiles is that all the length of the genes/features are taken into account to produce the profiles. Sometimes, we are interested only in one part of this information. For example, we may want to know how many genes (or what fraction of genes) have a TSS, on the same strand, at less than 20bp from their end? The metagene profiles above can provide an approximation but not the true value because some features actually start and end within this interval. The TSSs are single-point features. We can produce a coverage of the TSS using:

#First obtain the coordinates of the TSS:
tss <- GenomicRanges::promoters(Genegr, upstream = 0, downstream = 1)
#Then the coverage of TSSs around (+/- 50bp) the genes
tsscov <- annotationCoverageAroundFeatures(annot = tss, 
                                           features = Genegr, 
                                           sidedist = 50, 
                                           usePercent = TRUE,
                                           nbins = 3)
#> 4 windows exceeding chromosome borders are removed
#> Features are binned using binFeatureProfiles
#> Removing 136 features of size lower than 3bp
#> Bin size is: 2.13 +/- 0.74bp (mean +/- sd)
#> Removing 136 features of size lower than 3bp
#> Bin size is: 2.13 +/- 0.74bp (mean +/- sd)

But metagene plots as above will not work well because the data is too sparse. Instead, once a TSS is found at, say 5bp downstream of a gene border, we would like to extend its "presence" value (1) to any distance larger than 5bp to indicate that this specific gene has a TSS at less than any distance larger than 5. We do this using:

extTSScov <- extendPointPresence(tsscov, sidedist = 50)

Now for different groups of genes we can obtain the cumulative percentage of genes that have a TSS at less than a given distance with:

CP <- getCumulPercentProfiles(extTSScov,
                              genesets = GeneGroups)

And plot the results with:

plotCumulPercentProfile(CP)

You can’t perform that action at this time.