Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiple genomes and annotation for transcriptome mapping #46

Merged
merged 64 commits into from
Dec 11, 2019
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
09993cf
Add transcriptome
drpatelh Dec 6, 2019
03ffe94
Add igenomes logic to main script
drpatelh Dec 6, 2019
a15d60c
Fix channels
drpatelh Dec 6, 2019
640a218
Raise log warnings
drpatelh Dec 6, 2019
2bfa17c
Update log to error
drpatelh Dec 6, 2019
a3c6075
Update fastq checks
drpatelh Dec 6, 2019
aeb85d8
Exit if fastq extension wrong
drpatelh Dec 6, 2019
8c44e14
Update fastq logs
drpatelh Dec 6, 2019
2b3d0d2
Tidy up code
drpatelh Dec 6, 2019
2ceddfd
Fix remnants
drpatelh Dec 6, 2019
dd7f8b2
Add in basecalling
drpatelh Dec 6, 2019
88c05cb
Fix config bug
drpatelh Dec 6, 2019
7dd5370
Fix channels
drpatelh Dec 6, 2019
7575d34
Add in FastQC
drpatelh Dec 6, 2019
ca2174b
Run through FastQC
drpatelh Dec 6, 2019
af2b2c9
Refactor genome - transcriptome
drpatelh Dec 6, 2019
f3af78a
Use transcriptome too
drpatelh Dec 6, 2019
4a6388c
Update logic
drpatelh Dec 6, 2019
fcbc146
Update logic
drpatelh Dec 6, 2019
ed759f6
Add in Guppy againnnnn
drpatelh Dec 6, 2019
2cca379
Add in FastQC
drpatelh Dec 6, 2019
a604bf1
Update gtf logic
drpatelh Dec 6, 2019
23526ba
Add function to check
drpatelh Dec 6, 2019
5a843c2
Output everything
drpatelh Dec 6, 2019
c391b2b
Update logic for gtf and fasta
drpatelh Dec 6, 2019
0f6390b
Add header
drpatelh Dec 6, 2019
1ad00d1
Update outputs
drpatelh Dec 6, 2019
89031c1
Add in FastQC
drpatelh Dec 6, 2019
0ac48a8
Add in function again
drpatelh Dec 6, 2019
4785b5a
Fix channels
drpatelh Dec 6, 2019
3eae1ad
Update main.nf
drpatelh Dec 6, 2019
4449030
Update samplesheet
drpatelh Dec 8, 2019
84346c3
Fix logic
drpatelh Dec 8, 2019
7faee80
Refactor if statements
drpatelh Dec 8, 2019
25b2d3e
Remove dup code
drpatelh Dec 8, 2019
753aad8
Put channel edir in process
drpatelh Dec 8, 2019
5e097f7
Add chrom sizes
drpatelh Dec 8, 2019
8014fcf
Add in all processes
drpatelh Dec 8, 2019
f08a55b
Add gtf2bed
drpatelh Dec 8, 2019
33295b9
Add gtf2bed
drpatelh Dec 8, 2019
ed8a363
Add gtf2bed
drpatelh Dec 8, 2019
36ec990
Add gtf2bed
drpatelh Dec 8, 2019
37d3ea0
Comment out code for testing
drpatelh Dec 8, 2019
0e423a7
Fix conflicts from @lwratten split
drpatelh Dec 9, 2019
6b0649b
Change shebang
drpatelh Dec 9, 2019
fcbc68b
Change process name
drpatelh Dec 9, 2019
32f4707
Change process name
drpatelh Dec 9, 2019
b1fdf7e
Fix channels
drpatelh Dec 9, 2019
12a30c0
Update minimap2 index
drpatelh Dec 9, 2019
dab652a
Add graphmap2 index
drpatelh Dec 9, 2019
f6544bf
Fix file staging
drpatelh Dec 9, 2019
29b3367
Complete refactor again
drpatelh Dec 9, 2019
9d54070
Get channels working
drpatelh Dec 9, 2019
53480ad
Add in graphmap
drpatelh Dec 9, 2019
221f4d0
Add in bedgraph
drpatelh Dec 9, 2019
b6dad66
Add in BigWig
drpatelh Dec 9, 2019
a1ff4c7
Change file extensions
drpatelh Dec 9, 2019
4f8d1b0
Change channel content
drpatelh Dec 9, 2019
debaa38
Add in MultiQC
drpatelh Dec 9, 2019
c2b7acc
Tidy up
drpatelh Dec 9, 2019
eadc3de
Add @lwrattens suggestions
drpatelh Dec 10, 2019
50b2e77
Fix indents
drpatelh Dec 10, 2019
b8fc84f
Fix and and bug
drpatelh Dec 10, 2019
b333ca6
Fix channel not being defined
drpatelh Dec 10, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 32 additions & 10 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
############################################

ERROR_STR = 'ERROR: Please check samplesheet'
HEADER = ['sample', 'fastq', 'barcode', 'genome']
HEADER = ['sample', 'fastq', 'barcode', 'genome', 'transcriptome']

## CHECK HEADER
fin = open(args.DESIGN_FILE_IN,'r')
Expand All @@ -50,49 +50,71 @@
line = fin.readline()
if line:
lspl = [x.strip() for x in line.strip().split(',')]
sample,fastq,barcode,genome = lspl
sample,fastq,barcode,genome,transcriptome = lspl

## CHECK VALID NUMBER OF COLUMNS PER SAMPLE
numCols = len([x for x in lspl if x])
if numCols < 2:
print("{}: Invalid number of columns (minimum of 2)!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)

## CHECK SAMPLE ID ENTRIES
if sample:
## CHECK SAMPLE ID HAS NO SPACES
if sample.find(' ') != -1:
print("{}: Sample ID contains spaces!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)
else:
print("{}: Sample ID not specified!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)

## CHECK BARCODE ENTRIES
if barcode:
## CHECK BARCODE COLUMN IS INTEGER
if not barcode.isdigit():
print("{}: Barcode not an integer!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)
else:
barcode = 'barcode%s' % (barcode.zfill(2))

## CHECK FASTQ ENTRIES
if fastq:
## CHECK FASTQ FILE EXTENSION
if fastq[-9:] != '.fastq.gz' and fastq[-6:] != '.fq.gz':
print("{}: FastQ file has incorrect extension (has to be '.fastq.gz' or '.fq.gz')!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)

## CHECK GENOME ENTRIES
if genome:
## CHECK GENOME HAS NO SPACES
if genome.find(' ') != -1:
print("{}: Genome field contains spaces!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)

## CHECK GENOME EXTENSION
if len(genome.split('.')) > 1:
if genome[-6:] != '.fasta' and genome[-3:] != '.fa' and genome[-9:] != '.fasta.gz' and genome[-6:] != '.fa.gz':
print("{}: Genome field incorrect extension (has to be '.fasta' or '.fa' or '.fasta.gz' or '.fa.gz')!\nLine: '{}'".format(ERROR_STR,line.strip()))
print("{}: Genome field incorrect extension (has to be '.fasta', '.fa', '.fasta.gz' or '.fa.gz')!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)

outLines.append([sample,fastq,barcode,genome])
## CHECK TRANSCRIPTOME ENTRIES
gtf = ''
is_transcripts = '0'
if transcriptome:

if transcriptome.find(' ') != -1:
print("{}: Transcriptome field contains spaces!\nLine: '{}'".format(ERROR_STR,line.strip()))
sys.exit(1)

if transcriptome[-6:] != '.fasta' and transcriptome[-3:] != '.fa' and transcriptome[-9:] != '.fasta.gz' and transcriptome[-6:] != '.fa.gz' and transcriptome[-4:] != '.gtf':
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
print("{}: Transcriptome field incorrect extension (has to be '.fasta', '.fa', '.fasta.gz', '.fa.gz' or '.gtf')!\nLine: '{}'".format(ERROR_STR,line.strip()))
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(1)

if transcriptome[-4:] == '.gtf':
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
gtf = transcriptome
if not genome:
print("{}: If genome isnt provided, transcriptome must be in fasta format for mapping!\nLine: '{}'".format(ERROR_STR,line.strip()))
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(1)
else:
lwratten marked this conversation as resolved.
Show resolved Hide resolved
is_transcripts = '1'
genome = transcriptome

outLines.append([sample,fastq,barcode,genome,gtf,is_transcripts])
else:
fin.close()
break
Expand All @@ -106,7 +128,7 @@

## WRITE TO FILE
fout = open(args.DESIGN_FILE_OUT,'w')
fout.write(','.join(HEADER) + '\n')
fout.write(','.join(['sample', 'fastq', 'barcode', 'genome', 'gtf', 'is_transcripts']) + '\n')
for line in outLines:
fout.write(','.join(line) + '\n')
fout.close()
123 changes: 123 additions & 0 deletions bin/gtf2bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/usr/bin/env perl

# Copyright (c) 2011 Erik Aronesty (erik@q32.com)
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#
# ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT.

use Getopt::Long;

my $extended;
GetOptions("x"=>\$extended);

$in = shift @ARGV;

my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n";
open IN, $in_cmd;

while (<IN>) {
$gff = 2 if /^##gff-version 2/;
$gff = 3 if /^##gff-version 3/;
next if /^#/ && $gff;

s/\s+$//;
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
my @f = split /\t/;
if ($gff) {
# most ver 2's stick gene names in the id field
($id) = $f[8]=~ /\bID="([^"]+)"/;
# most ver 3's stick unquoted names in the name field
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
} else {
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
}

next unless $id && $f[0];

if ($f[2] eq 'exon') {
die "no position at exon on line $." if ! $f[3];
# gff3 puts :\d in exons sometimes
$id =~ s/:\d+$// if $gff == 3;
push @{$exons{$id}}, \@f;
# save lowest start
$trans{$id} = \@f if !$trans{$id};
} elsif ($f[2] eq 'start_codon') {
#optional, output codon start/stop as "thick" region in bed
$sc{$id}->[0] = $f[3];
} elsif ($f[2] eq 'stop_codon') {
$sc{$id}->[1] = $f[4];
} elsif ($f[2] eq 'miRNA' ) {
$trans{$id} = \@f if !$trans{$id};
push @{$exons{$id}}, \@f;
}
}

for $id (
# sort by chr then pos
sort {
$trans{$a}->[0] eq $trans{$b}->[0] ?
$trans{$a}->[3] <=> $trans{$b}->[3] :
$trans{$a}->[0] cmp $trans{$b}->[0]
} (keys(%trans)) ) {
my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}};
my ($cds, $cde);
($cds, $cde) = @{$sc{$id}} if $sc{$id};

# sort by pos
my @ex = sort {
$a->[3] <=> $b->[3]
} @{$exons{$id}};

my $beg = $ex[0][3];
my $end = $ex[-1][4];

if ($dir eq '-') {
# swap
$tmp=$cds;
$cds=$cde;
$cde=$tmp;
$cds -= 2 if $cds;
$cde += 2 if $cde;
}

# not specified, just use exons
$cds = $beg if !$cds;
$cde = $end if !$cde;

# adjust start for bed
--$beg; --$cds;

my $exn = @ex; # exon count
my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start
my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size

my $gene_id;
my $extend = "";
if ($extended) {
($gene_id) = $attr =~ /gene_name "([^"]+)"/;
($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id;
$extend="\t$gene_id";
}
# added an extra comma to make it look exactly like ucsc's beds
print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n";
}


close IN;
3 changes: 3 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ process {
withName:GetChromSizes {
container = 'quay.io/biocontainers/samtools:1.9--h10a08f8_12'
}
withName:GTFToBED {
container = 'quay.io/biocontainers/perl:5.22.0.1--0'
}
withName:MiniMap2Index {
container = 'quay.io/biocontainers/minimap2:2.17--h8b12597_1'
}
Expand Down
Loading