Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also .

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also .
Commits on Nov 03, 2014
- extended and reformatted arg-parser to give users better information
  on the available options
- added more possibilities to add read group information to the BAM
  header (@rg line) based on direct input or a tcga xml file.
- replaced downstream usage of samtools with alternatives that are part
  of STAR (sorting, multiple input fast files)
- allowed alignment of single end read sets
- added additional decompression setting for bzips
- added bzip
- added pip to install more python packages
- added lxml package
- corrected install of STAR from previous commit
Commits on Nov 11, 2014
- read group information in header does now support multiple input files
- improved data management --> data can be written to tmp directory
- add junctions as output file for two pass run
- clean up data after STAR run
- cleaned code from old comment-lines
Commits on Nov 12, 2014
Commits on Dec 04, 2014
- add comment line to header
- add legacy sample ID
- fix wrong info in some of the fields
- minor code cleanup
- adapted STAR version in xml template
- changed perl script to dynamically fill in run command into xml
Commits on Dec 09, 2014
Update to first version of STAR alignment SOP
Commits on Dec 28, 2014
- replaced STAR 2-pass alignment with equivalent workaround until
  2-pass alignment with including annotated junctions is available
- added option to allow for a weak RG check that replaces that generates
  a generic read group label in case one input file is given but the
  metadata contains several read groups
- minor changes to in parameter default settings
Commits on Dec 29, 2014
- added two new command line options to specify a metadata
  spreadsheet file and an analysis id to align instead of an xml
- added sanity checks for command line parameters
- fixed some comments in the code
Commits on Jan 14, 2015
Adapted parameter name metaData to metaDataTab as suggested by kellrott.
STAR alignment wrapper containing 2 pass alignment and use of metadata sheet
Commits on Feb 09, 2015
- bumped STAR version to 2.4.0i in the xml (used for alignments)
- changed stude name to "PCAWG 2.0"
- used analysis ID and aligner name for filename
- augmented input file by one column to reflect aligner information
- removed read group parsing from xml and rather use filenames to label
  read groups
- added memory-limit for BAM sorting
Commits on Feb 14, 2015
- adapted parsing of metadata table to use current header convention
- use fastq_files from metadata table for active whitelisting, only
  files that appear in the table will be used for alignment
- re-organized code a bit to reflect the above changes

This file was deleted.

@@ -0,0 +1,84 @@
usage: star_align.py [options]

ICGC RNA-Seq alignment wrapper for STAR alignments.

Required input parameters:
```
--genomeDir GENOMEDIR
Directory containing the reference genome index
(default: None)
--tarFileIn TARFILEIN
Input file containing the sequence information
(default: None)
optional input parameters:
--out OUT Name of the output BAM file (default: out.bam)
--workDir WORKDIR Work directory (default: ./)
--metaDataTab METADATATAB
File containing metadata for the alignment header
(default: None)
--analysisID ANALYSISID
Analysis ID to be considered in the metadata file
(default: None)
--keepJunctions keeps the junction file as {--out}.junctions (default:
False)
--useTMP USETMP environment variable that is used as prefix for
temprary data (default: None)
--weakRGcheck only perform weak RG record check and generate generic
RG ID in case of a single alignment file with multiple
RG records present. Use with caution! (default: False)
-h, --help show this help message and exit (default: False)
STAR input parameters:
--runThreadN RUNTHREADN
Number of threads (default: 4)
--outFilterMultimapScoreRange OUTFILTERMULTIMAPSCORERANGE
outFilterMultimapScoreRange (default: 1)
--outFilterMultimapNmax OUTFILTERMULTIMAPNMAX
outFilterMultimapNmax (default: 20)
--outFilterMismatchNmax OUTFILTERMISMATCHNMAX
outFilterMismatchNmax (default: 10)
--alignIntronMax ALIGNINTRONMAX
alignIntronMax (default: 500000)
--alignMatesGapMax ALIGNMATESGAPMAX
alignMatesGapMax (default: 1000000)
--sjdbScore SJDBSCORE
sjdbScore (default: 2)
--alignSJDBoverhangMin ALIGNSJDBOVERHANGMIN
alignSJDBoverhangMin (default: 1)
--genomeLoad GENOMELOAD
genomeLoad (default: NoSharedMemory)
--genomeFastaFiles GENOMEFASTAFILES
genome sequence in fasta format to rebuild index
(default: None)
--outFilterMatchNminOverLread OUTFILTERMATCHNMINOVERLREAD
outFilterMatchNminOverLread (default: 0.33)
--outFilterScoreMinOverLread OUTFILTERSCOREMINOVERLREAD
outFilterScoreMinOverLread (default: 0.33)
--twopass1readsN TWOPASS1READSN
twopass1readsN (-1 means all reads are used for
remapping) (default: -1)
--sjdbOverhang SJDBOVERHANG
sjdbOverhang (only necessary for two-pass mode)
(default: 100)
--outSAMstrandField OUTSAMSTRANDFIELD
outSAMstrandField (default: intronMotif)
--outSAMattributes OUTSAMATTRIBUTES
outSAMattributes (default: ['NH', 'HI', 'NM', 'MD',
'AS', 'XS'])
--outSAMunmapped OUTSAMUNMAPPED
outSAMunmapped (default: Within)
--outSAMtype OUTSAMTYPE
outSAMtype (default: ['BAM', 'SortedByCoordinate'])
--outSAMheaderHD OUTSAMHEADERHD
outSAMheaderHD (default: ['@HD', 'VN:1.4'])
--outSAMattrRGline OUTSAMATTRRGLINE
RG attribute line submitted to outSAMattrRGline
(default: None)
--outSAMattrRGfile OUTSAMATTRRGFILE
File containing the RG attribute line submitted to
outSAMattrRGline (default: None)
--outSAMattrRGxml OUTSAMATTRRGXML
XML-File in TCGA format to compile RG attribute line
(default: None)
```
@@ -2,7 +2,7 @@
<ANALYSIS_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA_1-5/SRA.analysis.xsd?view=co">
<ANALYSIS alias="PCAWG.94f8d946-a92e-4d2b-9f29-8e10f4274efe.bam" analysis_date="2014-08-02T21:48:26.320095" center_name="MSKCC">
<TITLE>TCGA/ICGC PanCancer RNA-Seq realignment</TITLE>
<STUDY_REF refcenter="UCSC" refname="PCAWG_TEST">
<STUDY_REF refcenter="UCSC" refname="PCAWG 2.0">
</STUDY_REF>
<DESCRIPTION>STAR realignment of TCGA RNA-Seq data from matching participants from the PanCan Analysis of Whole Genomes (PCAWG) study</DESCRIPTION>
<ANALYSIS_TYPE>
@@ -104,16 +104,9 @@
<STEP_INDEX>mapping</STEP_INDEX>
<PREV_STEP_INDEX>NIL</PREV_STEP_INDEX>
<PROGRAM>STAR</PROGRAM>
<VERSION>2.4.0b</VERSION>
<VERSION>2.4.0i</VERSION>
<NOTES>STAR --genomeDir hg19_GRCh37.p13_STAR_gencode19.overhang100 --readFilesIn left.fastq right.fastq --runThreadN runThreadN --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --genomeLoad NoSharedMemory --readFilesCommand bzcat --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --outSAMstrandField intronMotif --outSAMattributes NH HI NM MD AS XS --outSAMunmapped Within --outSAMtype BAM Unsorted</NOTES>
</PIPE_SECTION>
<PIPE_SECTION section_name="bam_sort">
<STEP_INDEX>bam_sort</STEP_INDEX>
<PREV_STEP_INDEX>mapping</PREV_STEP_INDEX>
<PROGRAM>samtools sort</PROGRAM>
<VERSION>0.1.19</VERSION>
<NOTES>samtools sort - "final.bam"</NOTES>
</PIPE_SECTION>
</PIPELINE>
<DIRECTIVES>
<alignment_includes_unaligned_reads>true</alignment_includes_unaligned_reads>
@@ -131,27 +124,27 @@
<ANALYSIS_ATTRIBUTES>
<ANALYSIS_ATTRIBUTE>
<TAG>STUDY</TAG>
<VALUE>PCAWG_TEST</VALUE>
<VALUE>PCAWG 2.0</VALUE>
</ANALYSIS_ATTRIBUTE>
<ANALYSIS_ATTRIBUTE>
<TAG>workflow_name</TAG>
<VALUE>Workflow_Bundle_STAR</VALUE>
<VALUE>RNA-Seq_Alignment_SOP_STAR</VALUE>
</ANALYSIS_ATTRIBUTE>
<ANALYSIS_ATTRIBUTE>
<TAG>workflow_version</TAG>
<VALUE>0.1</VALUE>
<VALUE>v1</VALUE>
</ANALYSIS_ATTRIBUTE>
<ANALYSIS_ATTRIBUTE>
<TAG>workflow_source_url</TAG>
<VALUE>https://github.com/kellrott/icgc_rnaseq_align</VALUE>
<VALUE>https://github.com/ucscCancer/icgc_rnaseq_align</VALUE>
</ANALYSIS_ATTRIBUTE>
<ANALYSIS_ATTRIBUTE>
<TAG>workflow_bundle_url</TAG>
<VALUE>NA</VALUE>
</ANALYSIS_ATTRIBUTE>
<ANALYSIS_ATTRIBUTE>
<TAG>STAR_version</TAG>
<VALUE>2.4.0b</VALUE>
<VALUE>2.4.0i</VALUE>
</ANALYSIS_ATTRIBUTE>
</ANALYSIS_ATTRIBUTES>
</ANALYSIS>
@@ -37,13 +37,13 @@ ()
while(my $line = <IN>)
{
chomp($line);
my ($original_analysis_id,$new_filepath,$new_md5) = split(/\t/,$line);
my ($original_analysis_id,$new_filepath,$new_md5,$run_cmd,$aligner) = split(/\t/,$line);
#dump original metadata
run_command("dump_all_metadata.py $original_analysis_id",$STDOUT_FILE,$STDERR_FILE);
#exract original metadata still relevant to PCAWG metadata
my $md_lines = extract_old_metadata_elements($original_analysis_id);
#create new metadata package from old metadata bits and template
my $new_analysis_id = synthesize_new_analysis($md_lines,$new_filepath,$new_md5,$template_analysis_xml,$original_analysis_id);
my $new_analysis_id = synthesize_new_analysis($md_lines,$new_filepath,$new_md5,$run_cmd,$template_analysis_xml,$original_analysis_id,$aligner);
#do the actual validation->submission->upload
if(validate_new_metadata($new_analysis_id) && $submit_key)
{
@@ -94,10 +94,10 @@ ()

sub synthesize_new_analysis()
{
my ($md_lines,$filepath,$md5,$templateF,$original_analysis_id) = @_;
my ($md_lines,$filepath,$md5,$run_cmd,$templateF,$original_analysis_id,$aligner) = @_;
my @run_cmds = split(/\$/,$run_cmd);

my @f=split(/\//,$filepath);
my $filename = pop(@f);
my $filename = "PCAWG.$original_analysis_id.$aligner.v1.bam";

my $new_analysis_id = run_command('uuidgen');
chomp($new_analysis_id);
@@ -107,7 +107,7 @@ ()
#copy the original metadata into the new directory
run_command("rsync -av $original_analysis_id/*.xml $new_analysis_id/");
#link in the new realigned file into the new directory
run_command("ln -s $filepath $new_analysis_id/PCAWG.$filename");
run_command("ln -s $filepath $new_analysis_id/$filename");

open(TEMPLATE,"<$templateF");
open(OUT,">$new_analysis_id/analysis.xml");
@@ -121,9 +121,11 @@ ()
if($line =~ /<FILE/)
{
$line =~ s/checksum="[^"]*"/checksum="$md5"/;
$line =~ s/filename="[^"]*"/filename="PCAWG.$filename"/;
$line =~ s/filename="[^"]*"/filename="$filename"/;
$line =~ s/filetype="[^"]*"/filetype="bam"/;
}
my $tmp_cmd = join("\n", @run_cmds);
$line =~ s/<NOTES>.*<\/NOTES>/<NOTES>$tmp_cmd<\/NOTES>/;
if($line =~ /$FAILED_READS/)
{
my $cur_val = $1;
@@ -2,7 +2,7 @@ FROM ubuntu


RUN apt-get update
RUN apt-get install -y curl wget g++ make python libboost-dev libboost-thread-dev libboost-system-dev zlib1g-dev ncurses-dev unzip
RUN apt-get install -y curl wget g++ make python libboost-dev libboost-thread-dev libboost-system-dev zlib1g-dev ncurses-dev unzip gzip bzip2 libxml2-dev libxslt-dev python-dev python-pip
WORKDIR /opt

RUN mkdir /opt/bin
@@ -27,8 +27,12 @@ RUN tar xvzf tophat-2.0.12.tar.gz
RUN cd /opt/tophat-2.0.12 && ./configure --prefix=/opt --with-boost-libdir=/usr/lib/x86_64-linux-gnu/ && make && make install

#install STAR
RUN wget https://github.com/alexdobin/STAR/archive/STAR_2.4.0d.tar.gz
RUN tar xvzf STAR_2.4.0d.tar.gz
RUN cd /opt/STAR-STAR_2.4.0d && make && cp /opt/STAR-STAR_2.4.0d/STAR /opt/bin/
RUN wget https://github.com/alexdobin/STAR/archive/STAR_2.4.0i.tar.gz
RUN tar xvzf STAR_2.4.0i.tar.gz
RUN cp /opt/STAR-STAR_2.4.0i/bin/Linux_x86_64/STAR /opt/bin/

#install python packages
RUN pip install lxml


ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin

No commit comments for this range