### Generate count files 

LS180 SOX9 OE RNA-seq data

In [None]:
%%bash 
nohup bash salmon.sh ./ 24 > salmon.out&

In [36]:
cat salmon.sh

PDIR=$1
JOBS=$2
FASTQDIR='fastq'
INDEX='/rumi/shams/abe/genomes/hg38/gencode.v34/salmon_index/'

cd $PDIR
mkdir -p ./quants/

for fq1 in $FASTQDIR/*R1*.fastq.gz; do 
	samp=`basename ${fq1}`; 
    samp=${samp/_R1_001.fastq.gz/}; 
    fq2=${fq1/R1/R2};
	echo "Processing sample ${samp}"; 
	cmd="salmon quant -i $INDEX -l A -1 $fq1 -2 $fq2 -p $JOBS --validateMappings -o ./quants/$samp"
	echo $cmd
	$cmd &> ./quants/${samp}.log
	echo DONE@ `date`; 
done


In [21]:
cat salmon.out

Processing sample GFPN1
salmon quant -i /rumi/shams/abe/genomes/hg38/gencode.v34/salmon_index/ -l A -1 fastq/GFPN1_R1_001.fastq.gz -2 fastq/GFPN1_R2_001.fastq.gz -p 24 --validateMappings -o ./quants/GFPN1
DONE@ Thu Nov 4 05:21:07 UTC 2021
Processing sample GFPN2
salmon quant -i /rumi/shams/abe/genomes/hg38/gencode.v34/salmon_index/ -l A -1 fastq/GFPN2_R1_001.fastq.gz -2 fastq/GFPN2_R2_001.fastq.gz -p 24 --validateMappings -o ./quants/GFPN2
DONE@ Thu Nov 4 05:23:15 UTC 2021
Processing sample GFPP1
salmon quant -i /rumi/shams/abe/genomes/hg38/gencode.v34/salmon_index/ -l A -1 fastq/GFPP1_R1_001.fastq.gz -2 fastq/GFPP1_R2_001.fastq.gz -p 24 --validateMappings -o ./quants/GFPP1
DONE@ Thu Nov 4 05:25:32 UTC 2021
Processing sample GFPP2
salmon quant -i /rumi/shams/abe/genomes/hg38/gencode.v34/salmon_index/ -l A -1 fastq/GFPP2_R1_001.fastq.gz -2 fastq/GFPP2_R2_001.fastq.gz -p 24 --validateMappings -o ./quants/GFPP2
DONE@ Thu Nov 4 05:27:51 UTC 2021
Processing sample M303N1
salmon

In [2]:
%load_ext rpy2.ipython

In [14]:
%%R 
suppressMessages(suppressWarnings(library(IsoformSwitchAnalyzeR)))
suppressMessages(suppressWarnings(library(GenomicFeatures)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(BiocParallel)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(ggrepel)))
suppressMessages(suppressWarnings(library(patchwork)))

> ### _IsoformSwitchAnalyzeR_
> Enabling Identification and Analysis of Isoform Switches with Functional Consequences and the Associated Alternative Splicing
> https://bioconductor.org/packages/devel/bioc/vignettes/IsoformSwitchAnalyzeR/inst/doc/IsoformSwitchAnalyzeR.html

In [18]:
%%R 
packageVersion('IsoformSwitchAnalyzeR')

[1] ‘1.10.0’


> ### _Importing the Data_
The first step is to import all data needed for the analysis into R and then concatenate them as a `switchAnalyzeRlist` object. 

<!-- > #### _Importing Data from Salmon via Tximeta_
The following approach uses `tximeta` ([Love et al 2020](https://bioconductor.org/packages/devel/bioc/vignettes/tximeta/inst/doc/tximeta.html)) to import Salmon quantification into R. The nice thing about using `tximeta` is that it automatically identifies which transcriptome was quantified (if quantified with Salmon >= 1.1.0). If the quantififed transcriptome is one of the main databases and model organisms `tximeta` support we can automatically import the assocated annoation (GTF and Fasta file) making the creation of the `switchAnalyzeRlist` smoother.

> This approach has two steps:
> - Step 1) Create a `data.frame` indicating which quantification files to import as well as annotating which files belong to which conditions.
> - Step 2) Create the switchAnalyzeRlist from the data.frame produced in step 1.

> __Step 1__ The `data.frame` decsribed above can either be created using the build in prepareSalmonFileDataFrame() wrapper or created manually (see documentation of prepareSalmonFileDataFrame() for specifications.). The prepareSalmonFileDataFrame() is a function where you supply the path to a directory and then the function findes Salmon quantification files (each located in a seperate sub-directory) and creates the data.frame nessesary. It is used as follows:

 -->

In [21]:
%%R 
# Import Salmon example data in R package
salmonQuant <- importIsoformExpression(
    parentDir = './quants',
    addIsofomIdAsColumn = TRUE
)

# salmonDf <- prepareSalmonFileDataFrame(
#     parentDir = './quants'
# )

R[write to console]: Step 1 of 3: Identifying which algorithm was used...

R[write to console]:     The quantification algorithm used was: Salmon

R[write to console]:     Found 12 quantification file(s) of interest

R[write to console]: Step 2 of 3: Reading data...

R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 2 
R[write to console]: 3 
R[write to console]: 4 
R[write to console]: 5 
R[write to console]: 6 
R[write to console]: 7 
R[write to console]: 8 
R[write to console]: 9 
R[write to console]: 10 
R[write to console]: 11 
R[write to console]: 12 
R[write to console]: 

R[write to console]: Step 3 of 3: Normalizing FPKM/TxPM values via edgeR...

R[write to console]: Done




In [38]:
ls /rumi/shams/abe/genomes/hg38/gencode.v34/

gencode.v34.annotation.consExons.gtf  gencode.v34.transcripts.fa
gencode.v34.annotation.gtf            [0m[01;31mgene2name-gencode.v34.csv.gz[0m
gencode.v34.annotation.Introns.gtf    GRCh38.primary_assembly.genome.fa
[01;31mgencode.v34.basic.annotation.gtf.gz[0m   [01;34msalmon_index[0m/
gencode.v34.metadata.EntrezGene       [01;34mstar_index[0m/


In [39]:
%%R 
GTFFile='/rumi/shams/abe/genomes/hg38/gencode.v34/gencode.v34.annotation.gtf'
FASTAFile='/rumi/shams/abe/genomes/hg38/gencode.v34/gencode.v34.transcripts.fa'
gtf <- rtracklayer::import(GTFFile)

gene2name <- gtf[gtf$type == "gene"] %>% data.frame %>% column_to_rownames('gene_id') %>% dplyr::select('gene_name')

txdb  = makeTxDbFromGFF(
    '/rumi/shams/abe/genomes/hg38/gencode.v34/gencode.v34.annotation.gtf', 
    organism='Homo sapiens')

# tx2gene objects 
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(txdb, k, "GENEID", "TXNAME")

R[write to console]: Import genomic features from the file as a GRanges object ... 
R[write to console]: OK

R[write to console]: Prepare the 'metadata' data frame ... 
R[write to console]: OK

R[write to console]: Make the TxDb object ... 
R[write to console]: OK

R[write to console]: 'select()' returned 1:1 mapping between keys and columns



In [40]:
%%R 
tx2gene %>% head 

             TXNAME            GENEID
1 ENST00000456328.2 ENSG00000223972.5
2 ENST00000450305.2 ENSG00000223972.5
3 ENST00000473358.1 ENSG00000243485.5
4 ENST00000469289.1 ENSG00000243485.5
5 ENST00000607096.1 ENSG00000284332.1
6 ENST00000606857.1 ENSG00000268020.3


> Apart from the isoform quantification we need 2-3 additional sets of annotation in the IsoformSwitchAnalyzeR workflow:
> 1. A design matrix
> 2. The transcript structure of the isoforms (in genomic coordinates) as well as information about which isoforms originate from the same gene (GTF file). You just have to tell IsoformSwitchAnalyzeR where the file is located on your computer/server. Please note that IsoformSwitchAnalyzeR also support RefSeq GFF files - for all other database you have to use a GTF file - see What Transcript Database Should I Use? for recomendations on databases and more info on how to obtain these.
> 3. We highly recommended that the transcript nucleotide sequences (the fasta file used to make the index for the quantification tool) is also supplied to IsoformSwitchAnalyzeR. 

In [52]:
%%R 
myDesign <- read.table('samplesheet.txt', sep='\t', header=TRUE) %>% 
    mutate(condition=paste(Group,'-',DOX,sep=''))%>% 
    select(c('FastQ','condition')) %>% dplyr::rename(sampleID=FastQ)

myDesign

   sampleID          condition
1     GFPN1         GFP-no-dox
2     GFPN2         GFP-no-dox
3     GFPP1            GFP-dox
4     GFPP2            GFP-dox
5    SOX9N1        SOX9-no-dox
6    SOX9N2        SOX9-no-dox
7    SOX9P1           SOX9-dox
8    SOX9P2           SOX9-dox
9    M303N1 mutSOX9-303-no-dox
10   M303N2 mutSOX9-303-no-dox
11   M303P1    mutSOX9-303-dox
12   M303P2    mutSOX9-303-dox


> ### _comparisonsToMake_
A data.frame with two columns indicating which pairwise comparisons the switchAnalyzeRlist created should contain. The two columns, called ’condition_1’ and
’condition_2’ indicate which conditions should be compared and the strings indicated here must match the strings in the `designMatrix$condition` column.
If not supplied all pairwise (unique non directional) comparisons of the conditions given in `designMatrix$condition` are created. If only a subset of the
supplied data is used in the comparisons the Un-used data is automatically removed.

Nilay's emails: 
> so this is the sample, where we expect the magic:

    SOX9P1	SOX9	dox	1
    SOX9P2	SOX9	dox	2
> all other samples can be used as a control, but these are the best controls to use.

    GFP no dox
    GFP dox
    and
    SOX9 no dox

In [51]:
%%R 
myComparisons = data.frame(
    condition_2=rep('SOX9-dox',3),
    condition_1=c('SOX9-no-dox','GFP-no-dox','GFP-dox')
)

myComparisons

  condition_2 condition_1
1    SOX9-dox SOX9-no-dox
2    SOX9-dox  GFP-no-dox
3    SOX9-dox     GFP-dox


In [53]:
%%R 
# Create switchAnalyzeRlist
aSwitchList <- importRdata(
    isoformCountMatrix   = salmonQuant$counts,
    isoformRepExpression = salmonQuant$abundance,
    designMatrix         = myDesign,
    comparisonsToMake    = myComparisons,
    isoformExonAnnoation = GTFFile,
    isoformNtFasta       = FASTAFile,
    showProgress = TRUE
)

R[write to console]: Step 1 of 6: Checking data...

R[write to console]: Please note that some condition names were changed due to names not suited for modeling in R.

R[write to console]: Step 2 of 6: Obtaining annotation...

R[write to console]:     importing GTF (this may take a while)

R[write to console]: Step 3 of 6: Calculating gene expression and isoform fraction...

R[write to console]:      98366 ( 43.5%) isoforms were removed since they were not expressed in any samples.

R[write to console]: Step 4 of 6: Merging gene and isoform expression...





R[write to console]: Step 5 of 6: Making comparisons...





R[write to console]: Step 6 of 6: Making switchAnalyzeRlist object...

R[write to console]: Done



In [54]:
!mkdir -p iso

In [55]:
%%R 
aSwitchListFiltered <- preFilter(
  switchAnalyzeRlist = aSwitchList,
  geneExpressionCutoff = 1,
  isoformExpressionCutoff = 0.5,
  removeSingleIsoformGenes = TRUE
)

aSwitchListAnalyzed <- isoformSwitchTestDRIMSeq(
     switchAnalyzeRlist = aSwitchListFiltered,
     testIntegration='isoform_only',
     alpha = 1,
     dIFcutoff = 0
)

extractSwitchSummary(aSwitchListAnalyzed)
aSwitchListAnalyzed <- analyzeIntronRetention(aSwitchListAnalyzed)
table(aSwitchListAnalyzed$isoformFeatures$IR)
consequencesOfInterest <- c('intron_retention')
aSwitchListAnalyzed <- analyzeSwitchConsequences(
  aSwitchListAnalyzed,
  consequencesToAnalyze = consequencesOfInterest,
  dIFcutoff = 0,
  alpha = 1,
  showProgress=FALSE
)

aSwitchListAnalyzed <- analyzeSwitchConsequences(
  aSwitchListAnalyzed,
  consequencesToAnalyze = consequencesOfInterest,
  dIFcutoff = 0,
  alpha = 1
)

write.table(aSwitchListAnalyzed$isoformFeatures, 'iso/isoform_switch_results.txt', sep='\t', quote=F)

saveRDS(aSwitchListAnalyzed, 'iso/aSwitchListAnalyzed.rds')
saveRDS(aSwitchList, 'iso/aSwitchList.rds')

#savehistory(file = "Iso.R")
# https://jmw86069.github.io/splicejam/index.html


R[write to console]: The filtering removed 83037 ( 65% of ) transcripts. There is now 44719 isoforms left

R[write to console]: Step 1 of 6: Creating DM data object...

R[write to console]: Step 2 of 6: Filtering DM data object...

R[write to console]: Step 3 of 6: Estimating precision paramter (this may take a while)...

R[write to console]: Step 4 of 6: Fitting linear models (this may take a while)...

R[write to console]: Step 5 of 6: Testing pairwise comparison(s)...





R[write to console]: Step 6 of 6: Preparing output...

R[write to console]: Result added switchAnalyzeRlist

R[write to console]: An isoform switch analysis was performed for 30315 gene comparisons (99.8%).

R[write to console]: Done

R[write to console]: Step 1 of 3: Massaging data...

R[write to console]: Step 2 of 3: Analyzing splicing...





R[write to console]: Step 3 of 3: Preparing output...

R[write to console]: Done

R[write to console]: Done

R[write to console]: Step 1 of 4: Extracting genes with isoform switches...

R[write to console]: Step 2 of 4: Analyzing 73565 pairwise isoforms comparisons...

R[write to console]: Step 3 of 4: Massaging isoforms comparisons results...

R[write to console]: Step 4 of 4: Preparing output...

R[write to console]: Identified  genes with containing isoforms switching with functional consequences...

R[write to console]: Step 1 of 4: Extracting genes with isoform switches...

R[write to console]: Step 2 of 4: Analyzing 73565 pairwise isoforms comparisons...





R[write to console]: Step 3 of 4: Massaging isoforms comparisons results...

R[write to console]: Step 4 of 4: Preparing output...

R[write to console]: Identified  genes with containing isoforms switching with functional consequences...



In [56]:
!date

Thu Nov  4 22:57:06 UTC 2021


In [58]:
%%R 
summary(aSwitchList)

This switchAnalyzeRlist list contains:
 127756 isoforms from 29904 genes
 3 comparison from 4 conditions (in total 8 samples)

Feature analyzed:
[1] "ORFs, ntSequence"


In [63]:
%%R 
data("exampleSwitchList")

# exampleSwitchList <- subsetSwitchAnalyzeRlist(
#     switchAnalyzeRlist = exampleSwitchList,
#     subset = abs(exampleSwitchList$isoformFeatures$dIF) > 0.4
# )

# extractSwitchSummary( exampleSwitchList )
exampleSwitchList 

This switchAnalyzeRlist list contains:
 259 isoforms from 84 genes
 1 comparison from 2 conditions (in total 6 samples)

Feature analyzed:
[1] "ORFs, ntSequence"


In [57]:
cat iso/isoform_switch_results.txt | head 

iso_ref	gene_ref	isoform_id	gene_id	condition_1	condition_2	gene_name	gene_biotype	iso_biotype	gene_overall_mean	gene_value_1	gene_value_2	gene_stderr_1	gene_stderr_2	gene_log2_fold_change	gene_q_value	iso_overall_mean	iso_value_1	iso_value_2	iso_stderr_1	iso_stderr_2	iso_log2_fold_change	iso_q_value	IF_overall	IF1	IF2	dIF	isoform_switch_q_value	gene_switch_q_value	PTC	IR	switchConsequencesGene
1	isoComp_00000001	geneComp_00000001	ENST00000373020.9	ENSG00000000003.15	GFP_dox	SOX9_dox	TSPAN6	protein_coding	protein_coding	23.84840538064	21.9864499735924	25.3385784511022	0.137502077308534	0.800108684363273	0.204634138408735	NA	19.1640260808249	18.8135356739888	21.5138005432493	0.265264530007853	0.208211449010186	0.193395204094833	NA	0.806125	0.85565	0.84965	-0.00600000000000001	1	0.683275245957705	FALSE	NA	FALSE
2	isoComp_00000002	geneComp_00000001	ENST00000496771.5	ENSG00000000003.15	GFP_dox	SOX9_dox	TSPAN6	protein_coding	processed_transcript	23.84840538064	21.9864499735924	25.33857845

### `bed` files 

Description of the files:
> There's two GFP dox positive controls GFP_1_DOX_pos and GFP_2_DOX_pos.
> - The SOX9 dox positive samples are SOX9_1_DOX_pos and SOX9_2_DOX_pos.
> - The SOX9 dox negative samples are sox9_1_DOX_neg and sox9_2_DOX_neg.
> - There's also the truncated version of SOX9 which is S303. The S303 dox positive samples are s303_1_DOX_pos and s303_2_DOX_pos, and the S303 dox negative samples are s303_1_DOX_neg and s303_2_DOX_neg.

> We use all 4 - GFP_dox, SOX9_no_dox, S303_dox, and S303_no_dox as controls against SOX9_dox.

In [64]:
!ls bed_files/

GFP_1_DOX_pos_IDX1.rep1_sorted_peaks.narrowPeak.bed
GFP_2_DOX_pos_IDX2.rep1_sorted_peaks.narrowPeak.bed
s303_1_DOX_neg_IDX13.rep1_sorted_peaks.narrowPeak.bed
s303_1_DOX_pos_IDX7.rep1_sorted_peaks.narrowPeak.bed
s303_2_DOX_neg_IDX14.rep1_sorted_peaks.narrowPeak.bed
s303_2_DOX_pos_IDX8.rep1_sorted_peaks.narrowPeak.bed
sox9_1_DOX_neg_IDX11.rep1_sorted_peaks.narrowPeak.bed
SOX9_1_DOX_pos_IDX3.rep1_sorted_peaks.narrowPeak.bed
sox9_2_DOX_neg_IDX12.rep1_sorted_peaks.narrowPeak.bed
SOX9_2_DOX_pos_IDX4.rep1_sorted_peaks.narrowPeak.bed
