Skip to content

Commit

Permalink
Merge c800cc8 into 7b80c9c
Browse files Browse the repository at this point in the history
  • Loading branch information
yang-yangfeng committed Jan 11, 2020
2 parents 7b80c9c + c800cc8 commit 4a0c2c4
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 39 deletions.
27 changes: 14 additions & 13 deletions docs/workflow.md
@@ -1,6 +1,6 @@
# RegTools example workflow

This is an example workflow for running RegTools on a cohort of samples. This analysis requires that there be a vcf and RNA bam/cram file for each samples. The outline described below was used to run our own analysis on TCGA data.
This is an example workflow for running RegTools on a cohort of samples. This analysis requires that there be a VCF and RNA bam file for each sample. The workflow described below was used to run our own analysis on TCGA data.

By the end of the analysis, the directory structure should look like the example below. The `*` in the example below refers to the tag/parameter used to run `regtools cis-splice-effects identify` with.

Expand Down Expand Up @@ -40,65 +40,66 @@ By the end of the analysis, the directory structure should look like the example
- junction_pvalues_*.tsv
```

### Set tag and parameter shell arguments
## RegTools commands

**Set tag and parameter shell arguments**

```bash
tag=<tag>
param=<run option>
# (e.g. tag=default param=""; tag=E param="-E"; tag=i20e5 param="-i 20 -e 5")
```

### Run `regtools cis-splice-effects identify` with desired options for selecting variant and window size
**Run `regtools cis-splice-effects identify` with desired options for selecting variant and window size**

```bash
for i in samples/*/; do regtools cis-splice-effects identify $param -o ${i}/output/cse_identify_filtered_$tag.tsv -j ${i}/output/cse_identify_filtered_$tag.bed -v ${i}/output/cse_identify_filtered_$tag.vcf ${i}/variants.per_gene.vep.vcf.gz ${i}/tumor_rna_alignments.bam /reference.fa reference.gtf; done
```

### Make `variant.bed` for each sample
**Make `variant.bed` for each sample**

```bash
for i in samples/*/; do bash variants.sh ${i}/output/cse_identify_filtered_$tag.tsv ${i}/output/variants_$tag.bed; done
```

### Combine each sample's `variant.bed` file per tag to get all variants that were deemed significant to splicing across all samples for a given tag
**Combine each sample's `variant.bed` file per tag to get all variants that were deemed significant to splicing across all samples for a given tag**

```bash
echo -e 'chrom\tstart\tend\tsamples' > all_splicing_variants_$tag.bed
for i in samples/*/; do j=${i##samples/}; uniq ${i}output/variants_$tag.bed | awk -v var=${j%%/} '{print $0 "\t" var}' >> all_splicing_variants_$tag.bed; done
```

### Make vcf of all variants across all samples (from each sample's variants.vcf). Then, compress it and index it
**Make vcf of all variants across all samples (from each sample's variants.vcf). Then, compress it and index it.**

```bash
vcf-concat samples/*/variants.vcf.gz | vcf-sort > all_variants_sorted.vcf

###### optional ######
bgzip all_variants_sorted.vcf

tabix all_variants_sorted.vcf.gz
```

### Run `regtools cis-splice effects identify` on all samples with all variants (with `$tag` options as example)
**Run `regtools cis-splice effects identify` on all samples with all variants (with `$tag` options as example)**

```bash
for i in samples/*/; do bsub -oo $i/logs/regtools_compare_$tag.lsf regtools cis-splice-effects identify $param -o ${i}/output/cse_identify_filtered_compare_$tag.tsv -j ${i}/output/cse_identify_filtered_compare_$tag.bed -v ${i}/output/cse_identify_filtered_compare_$tag.vcf all_variants_sorted.vcf.gz ${i}/tumor_rna_alignments.bam reference.fa reference.gtf; done
```

## Beginning of compare junctions analysis
## Statistical analysis

### Make directory to store comparison results
**Make directory to store comparison results**

```bash
mkdir -p compare_junctions/hist
```

### Run `compare_junctions_hist.R` on sample data
**Run `stats_wrapper.py` on sample data**

```bash
Rscript --vanilla compare_junctions_hist.R <tag>
python3 stats_wrapper.py <tag>
```

### Run `filter_and_BH.R` to adjust p values and filter results
**Run `filter_and_BH.R` to adjust p values and filter results**

```bash
Rscript --vanilla filter_and_BH.R <tag>
Expand Down
86 changes: 60 additions & 26 deletions scripts/compare_junctions_hist_v2.R
Expand Up @@ -9,21 +9,21 @@ library(tidyverse)

debug = F

# system.time({
# if (debug){
# tag = paste("_", "default", sep="")
# } else {
# # get options tag
# args = commandArgs(trailingOnly = TRUE)
# tag = args[1]
# input_file = args[2]
# if ( substr(tag, 2, 3) == "--"){
# stop("Please specify an option tag (e.g. \"default\", \"i20e5\")")
# }
# }

tag = 'E'
input_file = 'all_splicing_variants_E.bed'
system.time({
if (debug){
tag = paste("_", "default", sep="")
} else {
# get options tag
args = commandArgs(trailingOnly = TRUE)
tag = args[1]
input_file = args[2]
if ( substr(tag, 2, 3) == "--"){
stop("Please specify an option tag (e.g. \"default\", \"i20e5\")")
}
}

# tag = 'E'
# input_file = '/Users/kcotto/Desktop/CHOL/all_splicing_variants_E.bed'

# All splicing relevant variants (union of rows from variants.bed files; add column with comma-separated list of sample names)
all_splicing_variants = unique(data.table::fread(input_file), sep = '\t', header = T, stringsAsFactors = FALSE)
Expand Down Expand Up @@ -187,14 +187,14 @@ get_mean <- function(x){
return(x)
}

x <- mapply(get_mean, regtools_data$norm_scores_non)
regtools_data$mean_norm_score_non <- x

get_sd <- function(x){
x <- sd(as.numeric(x))
return(x)
}

x <- mapply(get_mean, regtools_data$norm_scores_non)
regtools_data$mean_norm_score_non <- x

x <- mapply(get_sd, regtools_data$norm_scores_non)
regtools_data$sd_norm_score_non <- x

Expand Down Expand Up @@ -222,21 +222,54 @@ regtools_data$mean_norm_score_variant <- x
x <- mapply(get_sd, regtools_data$norm_scores_variant)
regtools_data$sd_norm_score_variant <- x

get_min <- function(x){
x <- min(as.numeric(x))
return(x)
}

x <- mapply(get_min, regtools_data$norm_scores_variant)
regtools_data$min_norm_score_variant <- x

print("test7")

################ calculate p-values ############################################

# calculate using mean
a <- function(x){
variant_norm_score = as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE)))
variant_norm_score = mean(as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE))))
if(length(x[['norm_scores_non']]) <= 1){
return(0)
}

all_norm_scores = c(x$norm_scores_non, variant_norm_score)
countable = rank(all_norm_scores)
num_samples = str_count(x$norm_scores_variant, ',') + 1
non_variant_norm_scores_ranked = head(countable, (-1 * num_samples))
variant_norm_score_ranked = tail(countable, num_samples)
non_variant_norm_scores_ranked = head(countable, -1)
variant_norm_score_ranked = tail(countable, 1)
histinfo = hist(non_variant_norm_scores_ranked,
breaks = seq(0.5, max(non_variant_norm_scores_ranked)+1.5, by=1), plot=F)
mids = histinfo$mids
cd = cumsum(histinfo$density)
underestimate = max(which(mids <= variant_norm_score_ranked))
pvalue = 1-cd[underestimate]
return(pvalue)
}

# calculate using min
b <- function(x){
variant_norm_score = min(as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE))))
if(variant_norm_score == 0){
return(1)
}
if(length(x[['norm_scores_non']]) <= 1){
return(0)
}

all_norm_scores = c(x$norm_scores_non, variant_norm_score)
countable = rank(all_norm_scores)
num_samples = str_count(x$norm_scores_variant, ',') + 1
non_variant_norm_scores_ranked = head(countable, -1)
variant_norm_score_ranked = tail(countable, 1)
histinfo = hist(non_variant_norm_scores_ranked,
breaks = seq(0.5, max(non_variant_norm_scores_ranked)+1.5, by=1), plot=F)
mids = histinfo$mids
Expand All @@ -246,7 +279,8 @@ a <- function(x){
return(pvalue)
}

regtools_data$p_value <- apply(regtools_data, 1, a)
regtools_data$p_value_mean <- apply(regtools_data, 1, a)
regtools_data$p_value_min <- apply(regtools_data, 1, b)
print("Number of rows in data.table")
print(length(regtools_data$samples))

Expand All @@ -258,12 +292,12 @@ regtools_data$norm_scores_non <- unlist(lapply(regtools_data$norm_scores_non,pas
columns_to_keep = c('samples', 'variant_info.x', 'genes', 'sample', "chrom.x", "start.x", "end.x", 'strand.x', 'anchor.x', 'info',
'names', 'mean_norm_score_variant', 'sd_norm_score_variant', 'norm_scores_variant',
'total_score_variant', 'mean_norm_score_non', 'sd_norm_score_non', 'norm_scores_non',
'total_score_non', 'p_value')
'total_score_non', 'p_value_mean','p_value_min')
regtools_data = subset(regtools_data, select=columns_to_keep)
colnames(regtools_data) <- c('variant_samples', 'variant_info', 'genes', 'junction_samples', "chrom", "start", "end", 'strand', 'anchor', 'variant_junction_info',
'names', 'mean_norm_score_variant', 'sd_norm_score_variant', 'norm_scores_variant',
'total_score_variant', 'mean_norm_score_non', 'sd_norm_score_non', 'norm_scores_non',
'total_score_non', 'pvalue')
'total_score_non', 'p_value_mean','p_value_min')
regtools_data$sd_norm_score_variant[is.na(regtools_data$sd_norm_score_variant)] = 0
regtools_data$mean_norm_score_non[is.na(regtools_data$mean_norm_score_non)] = 0
regtools_data$sd_norm_score_non[is.na(regtools_data$sd_norm_score_non)] = 0
Expand All @@ -274,5 +308,5 @@ regtools_data = regtools_data %>% distinct()


write.table(regtools_data, file=paste(input_file, "_out.tsv", sep=""), quote=FALSE, sep='\t', row.names = F)
#
# })

})

0 comments on commit 4a0c2c4

Please sign in to comment.