Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Min and average #133

Merged
merged 4 commits into from
Jan 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
27 changes: 14 additions & 13 deletions docs/workflow.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# RegTools example workflow

This is an example workflow for running RegTools on a cohort of samples. This analysis requires that there be a vcf and RNA bam/cram file for each samples. The outline described below was used to run our own analysis on TCGA data.
This is an example workflow for running RegTools on a cohort of samples. This analysis requires that there be a VCF and RNA bam file for each sample. The workflow described below was used to run our own analysis on TCGA data.

By the end of the analysis, the directory structure should look like the example below. The `*` in the example below refers to the tag/parameter used to run `regtools cis-splice-effects identify` with.

Expand Down Expand Up @@ -40,65 +40,66 @@ By the end of the analysis, the directory structure should look like the example
- junction_pvalues_*.tsv
```

### Set tag and parameter shell arguments
## RegTools commands

**Set tag and parameter shell arguments**

```bash
tag=<tag>
param=<run option>
# (e.g. tag=default param=""; tag=E param="-E"; tag=i20e5 param="-i 20 -e 5")
```

### Run `regtools cis-splice-effects identify` with desired options for selecting variant and window size
**Run `regtools cis-splice-effects identify` with desired options for selecting variant and window size**

```bash
for i in samples/*/; do regtools cis-splice-effects identify $param -o ${i}/output/cse_identify_filtered_$tag.tsv -j ${i}/output/cse_identify_filtered_$tag.bed -v ${i}/output/cse_identify_filtered_$tag.vcf ${i}/variants.per_gene.vep.vcf.gz ${i}/tumor_rna_alignments.bam /reference.fa reference.gtf; done
```

### Make `variant.bed` for each sample
**Make `variant.bed` for each sample**

```bash
for i in samples/*/; do bash variants.sh ${i}/output/cse_identify_filtered_$tag.tsv ${i}/output/variants_$tag.bed; done
```

### Combine each sample's `variant.bed` file per tag to get all variants that were deemed significant to splicing across all samples for a given tag
**Combine each sample's `variant.bed` file per tag to get all variants that were deemed significant to splicing across all samples for a given tag**

```bash
echo -e 'chrom\tstart\tend\tsamples' > all_splicing_variants_$tag.bed
for i in samples/*/; do j=${i##samples/}; uniq ${i}output/variants_$tag.bed | awk -v var=${j%%/} '{print $0 "\t" var}' >> all_splicing_variants_$tag.bed; done
```

### Make vcf of all variants across all samples (from each sample's variants.vcf). Then, compress it and index it
**Make vcf of all variants across all samples (from each sample's variants.vcf). Then, compress it and index it.**

```bash
vcf-concat samples/*/variants.vcf.gz | vcf-sort > all_variants_sorted.vcf

###### optional ######
bgzip all_variants_sorted.vcf

tabix all_variants_sorted.vcf.gz
```

### Run `regtools cis-splice effects identify` on all samples with all variants (with `$tag` options as example)
**Run `regtools cis-splice effects identify` on all samples with all variants (with `$tag` options as example)**

```bash
for i in samples/*/; do bsub -oo $i/logs/regtools_compare_$tag.lsf regtools cis-splice-effects identify $param -o ${i}/output/cse_identify_filtered_compare_$tag.tsv -j ${i}/output/cse_identify_filtered_compare_$tag.bed -v ${i}/output/cse_identify_filtered_compare_$tag.vcf all_variants_sorted.vcf.gz ${i}/tumor_rna_alignments.bam reference.fa reference.gtf; done
```

## Beginning of compare junctions analysis
## Statistical analysis

### Make directory to store comparison results
**Make directory to store comparison results**

```bash
mkdir -p compare_junctions/hist
```

### Run `compare_junctions_hist.R` on sample data
**Run `stats_wrapper.py` on sample data**

```bash
Rscript --vanilla compare_junctions_hist.R <tag>
python3 stats_wrapper.py <tag>
```

### Run `filter_and_BH.R` to adjust p values and filter results
**Run `filter_and_BH.R` to adjust p values and filter results**

```bash
Rscript --vanilla filter_and_BH.R <tag>
Expand Down
86 changes: 60 additions & 26 deletions scripts/compare_junctions_hist_v2.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,21 @@ library(tidyverse)

debug = F

# system.time({
# if (debug){
# tag = paste("_", "default", sep="")
# } else {
# # get options tag
# args = commandArgs(trailingOnly = TRUE)
# tag = args[1]
# input_file = args[2]
# if ( substr(tag, 2, 3) == "--"){
# stop("Please specify an option tag (e.g. \"default\", \"i20e5\")")
# }
# }

tag = 'E'
input_file = 'all_splicing_variants_E.bed'
system.time({
if (debug){
tag = paste("_", "default", sep="")
} else {
# get options tag
args = commandArgs(trailingOnly = TRUE)
tag = args[1]
input_file = args[2]
if ( substr(tag, 2, 3) == "--"){
stop("Please specify an option tag (e.g. \"default\", \"i20e5\")")
}
}

# tag = 'E'
# input_file = '/Users/kcotto/Desktop/CHOL/all_splicing_variants_E.bed'

# All splicing relevant variants (union of rows from variants.bed files; add column with comma-separated list of sample names)
all_splicing_variants = unique(data.table::fread(input_file), sep = '\t', header = T, stringsAsFactors = FALSE)
Expand Down Expand Up @@ -187,14 +187,14 @@ get_mean <- function(x){
return(x)
}

x <- mapply(get_mean, regtools_data$norm_scores_non)
regtools_data$mean_norm_score_non <- x

get_sd <- function(x){
x <- sd(as.numeric(x))
return(x)
}

x <- mapply(get_mean, regtools_data$norm_scores_non)
regtools_data$mean_norm_score_non <- x

x <- mapply(get_sd, regtools_data$norm_scores_non)
regtools_data$sd_norm_score_non <- x

Expand Down Expand Up @@ -222,21 +222,54 @@ regtools_data$mean_norm_score_variant <- x
x <- mapply(get_sd, regtools_data$norm_scores_variant)
regtools_data$sd_norm_score_variant <- x

get_min <- function(x){
x <- min(as.numeric(x))
return(x)
}

x <- mapply(get_min, regtools_data$norm_scores_variant)
regtools_data$min_norm_score_variant <- x

print("test7")

################ calculate p-values ############################################

# calculate using mean
a <- function(x){
variant_norm_score = as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE)))
variant_norm_score = mean(as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE))))
if(length(x[['norm_scores_non']]) <= 1){
return(0)
}

all_norm_scores = c(x$norm_scores_non, variant_norm_score)
countable = rank(all_norm_scores)
num_samples = str_count(x$norm_scores_variant, ',') + 1
non_variant_norm_scores_ranked = head(countable, (-1 * num_samples))
variant_norm_score_ranked = tail(countable, num_samples)
non_variant_norm_scores_ranked = head(countable, -1)
variant_norm_score_ranked = tail(countable, 1)
histinfo = hist(non_variant_norm_scores_ranked,
breaks = seq(0.5, max(non_variant_norm_scores_ranked)+1.5, by=1), plot=F)
mids = histinfo$mids
cd = cumsum(histinfo$density)
underestimate = max(which(mids <= variant_norm_score_ranked))
pvalue = 1-cd[underestimate]
return(pvalue)
}

# calculate using min
b <- function(x){
variant_norm_score = min(as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE))))
if(variant_norm_score == 0){
return(1)
}
if(length(x[['norm_scores_non']]) <= 1){
return(0)
}

all_norm_scores = c(x$norm_scores_non, variant_norm_score)
countable = rank(all_norm_scores)
num_samples = str_count(x$norm_scores_variant, ',') + 1
non_variant_norm_scores_ranked = head(countable, -1)
variant_norm_score_ranked = tail(countable, 1)
histinfo = hist(non_variant_norm_scores_ranked,
breaks = seq(0.5, max(non_variant_norm_scores_ranked)+1.5, by=1), plot=F)
mids = histinfo$mids
Expand All @@ -246,7 +279,8 @@ a <- function(x){
return(pvalue)
}

regtools_data$p_value <- apply(regtools_data, 1, a)
regtools_data$p_value_mean <- apply(regtools_data, 1, a)
regtools_data$p_value_min <- apply(regtools_data, 1, b)
print("Number of rows in data.table")
print(length(regtools_data$samples))

Expand All @@ -258,12 +292,12 @@ regtools_data$norm_scores_non <- unlist(lapply(regtools_data$norm_scores_non,pas
columns_to_keep = c('samples', 'variant_info.x', 'genes', 'sample', "chrom.x", "start.x", "end.x", 'strand.x', 'anchor.x', 'info',
'names', 'mean_norm_score_variant', 'sd_norm_score_variant', 'norm_scores_variant',
'total_score_variant', 'mean_norm_score_non', 'sd_norm_score_non', 'norm_scores_non',
'total_score_non', 'p_value')
'total_score_non', 'p_value_mean','p_value_min')
regtools_data = subset(regtools_data, select=columns_to_keep)
colnames(regtools_data) <- c('variant_samples', 'variant_info', 'genes', 'junction_samples', "chrom", "start", "end", 'strand', 'anchor', 'variant_junction_info',
'names', 'mean_norm_score_variant', 'sd_norm_score_variant', 'norm_scores_variant',
'total_score_variant', 'mean_norm_score_non', 'sd_norm_score_non', 'norm_scores_non',
'total_score_non', 'pvalue')
'total_score_non', 'p_value_mean','p_value_min')
regtools_data$sd_norm_score_variant[is.na(regtools_data$sd_norm_score_variant)] = 0
regtools_data$mean_norm_score_non[is.na(regtools_data$mean_norm_score_non)] = 0
regtools_data$sd_norm_score_non[is.na(regtools_data$sd_norm_score_non)] = 0
Expand All @@ -274,5 +308,5 @@ regtools_data = regtools_data %>% distinct()


write.table(regtools_data, file=paste(input_file, "_out.tsv", sep=""), quote=FALSE, sep='\t', row.names = F)
#
# })

})