-
Notifications
You must be signed in to change notification settings - Fork 6
/
kfdrc_alignment_wf.cwl
768 lines (717 loc) · 40.5 KB
/
kfdrc_alignment_wf.cwl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
cwlVersion: v1.0
class: Workflow
id: kfdrc-alignment-workflow
label: Kids First DRC Alignment and GATK HaplotypeCaller Workflow
doc: |
# Kids First Data Resource Center Alignment and GATK HaplotypeCaller Workflows
![data service logo](https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png)
The Kids First Data Resource Center Alignment and Haplotype Calling Workflow (bam/fastq-to-cram, gVCF optional) follows
Broad best practices outlined in [Data pre-processing for variant discovery.](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165)
It uses bam/fastq input and aligns/re-aligns to a bwa-indexed reference fasta, version hg38. Resultant bam is de-dupped and
base score recalibrated. Contamination is calculated and a gVCF is created optionally using GATK4 vbeta.1-3.5 HaplotypeCaller. Inputs from
this can be used later on for further analysis in joint trio genotyping and subsequent refinement and deNovo variant analysis. If you would like to run this workflow using the cavatica public app, a basic primer on running public apps can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-Cavatica-af5ebb78c38a4f3190e32e67b4ce12bb).
Alternatively, if you'd like to run it locally using `cwltool`, a basic primer on that can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-CWLtool-b8dbbde2dc7742e4aff290b0a878344d) and combined with app-specific info from the readme below.
This workflow is the current production workflow, equivalent to this [Cavatica public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-alignment-workflow) and supersedes the [old workflow](https://github.com/kids-first/kf-alignment-workflow/tree/1.0.0) and [public app](https://cavatica.sbgenomics.com/public/apps#kids-first-drc/kids-first-drc-alignment-workflow/kfdrc-alignment-bam2cram2gvcf/); however outputs are considered equivalent.
## Input Agnostic Alignment Workflow
Workflow for the alignment or realignment of input BAMs, PE reads, and/or SE reads; conditionally generate gVCF and metrics.
This workflow is a all-in-one workflow for handling any kind of reads inputs: BAM inputs, PE reads
and mates inputs, SE reads inputs, or any combination of these. The workflow will naively attempt
to process these depending on what you tell it you have provided. The user informs the workflow of
which inputs to process using three boolean inputs: `run_bam_processing`, `run_pe_reads_processing`,
and `run_se_reads_processing`. Providing `true` values for these as well their corresponding inputs
will result in those inputs being processed.
The second half of the workflow deals with optional gVCF creation and metrics collection.
This workflow is capable of collecting the metrics using the following boolean flags: `run_hs_metrics`,
`run_wgs_metrics`, and `run_agg_metrics`. To run these metrics, additional optional inputs must
also be provided: `wxs_bait_interval_list` and `wxs_target_interval_list` for HsMetrics,
`wgs_coverage_interval_list` for WgsMetrics. To generate the gVCF, set `run_gvcf_processing` to
`true` and provide the following optional files: `dbsnp_vcf`, `contamination_sites_bed`,
`contamination_sites_mu`, `contamination_sites_ud`, `wgs_calling_interval_list`, and
`wgs_evaluation_interval_list`. Additionally, the workflow is capable of performing a basic
evaluation of the X and Y sex chromosomes using idxstats. To activate this feature, set `run_sex_metrics`
to `true`; no additonal inputs are required.
### Basic Info
- dockerfiles: https://github.com/d3b-center/bixtools
- tested with
- Seven Bridges Cavatica Platform: https://cavatica.sbgenomics.com/
- cwltool: https://github.com/common-workflow-language/cwltool/releases/tag/3.0.20200324120055
### References:
- https://console.cloud.google.com/storage/browser/broad-references/hg38/v0/
- kfdrc bucket: s3://kids-first-seq-data/broad-references/
- cavatica: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
### Inputs:
```yaml
# REQUIRED
reference_tar: { type: File, doc: "Tar file containing a reference fasta and, optionally, its complete set of associated indexes (samtools, bwa, and picard)" }
biospecimen_name: { type: string, doc: "String name of biospcimen" }
output_basename: { type: string, doc: "String to use as the base for output filenames" }
knownsites: { type: 'File[]', doc: "List of files containing known polymorphic sites used to exclude regions around known polymorphisms from analysis" }
knownsites_indexes: { type: 'File[]?', doc: "Corresponding indexes for the knownsites. File position in list must match with its corresponding VCF's position in the knownsites file list. For example, if the first file in the knownsites list is 1000G_omni2.5.hg38.vcf.gz then the first item in this list must be 1000G_omni2.5.hg38.vcf.gz.tbi. Optional, but will save time/cost on indexing." }
# REQUIRED for gVCF
dbsnp_vcf: { type: 'File?', doc: "dbSNP vcf file" }
dbsnp_idx: { type: 'File?', doc: "dbSNP vcf index file" }
contamination_sites_bed: { type: 'File?', doc: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)" }
contamination_sites_mu: { type: 'File?', doc: ".mu matrix file of genotype matrix" }
contamination_sites_ud: { type: 'File?', doc: ".UD matrix file from SVD result of genotype matrix" }
run_gvcf_processing: { type: boolean, doc: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list" }
# ADJUST TO FIT INPUT READS TYPE(S)
input_bam_list: { type: 'File[]?', doc: "List of input BAM files" }
input_pe_reads_list: { type: 'File[]?', doc: "List of input R1 paired end fastq reads" }
input_pe_mates_list: { type: 'File[]?', doc: "List of input R2 paired end fastq reads" }
input_pe_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in PE processing" }
input_se_reads_list: { type: 'File[]?', doc: "List of input single end fastq reads" }
input_se_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in SE processing" }
run_bam_processing: { type: boolean, doc: "BAM processing will be run. Requires: input_bam_list" }
run_pe_reads_processing: { type: boolean, doc: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" }
run_se_reads_processing: { type: boolean, doc: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" }
# IF WGS or CREATE gVCF
wgs_calling_interval_list: { type: 'File?', doc: "WGS interval list used to aid scattering Haplotype caller" }
wgs_coverage_interval_list: { type: 'File?', doc: "An interval list file that contains the positions to restrict the wgs metrics assessment" }
wgs_evaluation_interval_list: { type: 'File?', doc: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)" }
# IF WXS
wxs_bait_interval_list: { type: 'File?', doc: "An interval list file that contains the locations of the WXS baits used (for HsMetrics)" }
wxs_target_interval_list: { type: 'File?', doc: "An interval list file that contains the locations of the WXS targets (for HsMetrics)" }
# ADJUST TO GENERATE METRICS
run_hs_metrics: { type: boolean, doc: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list" }
run_wgs_metrics: { type: boolean, doc: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list" }
run_agg_metrics: { type: boolean, doc: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected. Recommended for both WXS and WGS inputs." }
run_sex_metrics: {type: boolean, doc: "idxstats will be collected and X/Y ratios calculated"}
# ADVANCED
min_alignment_score: { type: 'int?', default: 30, doc: "For BWA MEM, Don't output alignment with score lower than INT. This option only affects output." }
```
### Outputs:
```yaml
cram: {type: File, outputSource: samtools_bam_to_cram/output, doc: "(Re)Aligned Reads File"}
gvcf: {type: 'File[]?', outputSource: generate_gvcf/gvcf, doc: "Genomic VCF generated from the realigned alignment file."}
verifybamid_output: {type: 'File[]?', outputSource: generate_gvcf/verifybamid_output, doc: "Ouput from VerifyBamID that is used to calculate contamination."}
bqsr_report: {type: File, outputSource: gatk_gatherbqsrreports/output, doc: "Recalibration report from BQSR."}
gvcf_calling_metrics: {type: ['null', {type: array, items: {type: array, items: File}}], outputSource: generate_gvcf/gvcf_calling_metrics, doc: "General metrics for gVCF calling quality."}
hs_metrics: {type: 'File[]?', outputSource: picard_collecthsmetrics/output, doc: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments."}
wgs_metrics: {type: 'File[]?', outputSource: picard_collectwgsmetrics/output, doc: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments."}
alignment_metrics: {type: 'File[]?', outputSource: picard_collectalignmentsummarymetrics/output, doc: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM file."}
gc_bias_detail: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/detail, doc: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome."}
gc_bias_summary: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/summary, doc: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is."}
gc_bias_chart: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/chart, doc: "Picard CollectGcBiasMetrics plot of GC bias."}
insert_metrics: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/metrics, doc: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library."}
insert_plot: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/plot, doc: "Picard CollectInsertSizeMetrics insert size distribution plotted."}
artifact_bait_bias_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_detail_metrics, doc: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context."}
artifact_bait_bias_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact."}
artifact_error_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/error_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution."}
artifact_pre_adapter_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_detail_metrics, doc: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context."}
artifact_pre_adapter_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact."}
qual_metrics: {type: 'File[]?', outputSource: picard_qualityscoredistribution/metrics, doc: "Quality metrics for the realigned CRAM."}
qual_chart: {type: 'File[]?', outputSource: picard_qualityscoredistribution/chart, doc: "Visualization of quality metrics."}
idxstats: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/output, doc: "samtools idxstats of the realigned BAM file."}
xy_ratio: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/ratio, doc: "Text file containing X and Y reads statistics generated from idxstats."}
```
#### Detailed Input Information:
The pipeline is build to handle three distinct input types:
1. BAMs
1. PE Fastqs
1. SE Fastqs
Additionally, the workflow supports these three in any combination. You can have PE Fastqs and BAMs,
PE Fastqs and SE Fastqs, BAMS and PE Fastqs and SE Fastqs, etc. Each of these three classes will be
procsessed and aligned separately and the resulting BWA aligned bams will be merged into a final BAM
before performing steps like BQSR and Metrics collection.
##### BAM Inputs
The BAM processing portion of the pipeline is the simplest when it comes to inputs. You may provide
a single BAM or many BAMs. The input for BAMs is a file list. In Cavatica or other GUI interfaces,
simply select the files you wish to process. For command line interfaces such as cwltool, your input
should look like the following.
```json
{
...,
"run_pe_reads_processing": false,
"run_se_reads_processing": false,
"run_bam_processing": true,
"input_bam_list": [
{
"class": "File",
"location": "/path/to/bam1.bam"
},
{
"class": "File",
"location": "/path/to/bam2.bam"
}
],
...
}
```
##### SE Fastq Inputs
SE fastq processing requires more input to build the jobs correctly. Rather than providing a single
list you must provide two lists: `input_se_reads_list` and `input_se_rgs_list`. The `input_se_reads_list`
is where you put the files and the `input_se_rgs_list` is where you put your desired BAM @RG headers for
each reads file. These two lists are must be ordered and of equal length. By ordered, that means the
first item of the `input_se_rgs_list` will be used when aligning the first item of the `input_se_reads_list`.
IMPORTANT NOTE: When you are entering the rg names, you need to use a second escape `\` to the tab values `\t`
as seen below. When the string value is read in by a tool such as cwltool it will interpret a `\\t` input
as `\t` and a `\t` as the literal `<tab>` value which is not a valid entry for bwa mem.
If you are using Cavatica GUI, however, no extra escape is necessary. The GUI will add an extra
escape to any tab values you enter.
In Cavatica make sure to double check that everything is in the right order when you enter the inputs.
In command line interfaces such as cwltool, your input should look like the following.
```json
{
...,
"run_pe_reads_processing": false,
"run_se_reads_processing": true,
"run_bam_processing": false,
"input_se_reads_list": [
{
"class": "File",
"location": "/path/to/single1.fastq"
},
{
"class": "File",
"location": "/path/to/single2.fastq"
}
],
"inputs_se_rgs_list": [
"@RG\\tID:single1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name",
"@RG\\tID:single2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name"
],
...
}
```
Take particular note of how the first item in the rgs list is the metadata for the first item in the fastq list.
##### PE Fastq Inputs
PE Fastq processing inputs is exactly like SE Fastq processing but requires you to provide the paired mates
files for your input paired reads. Once again, when using Cavatica make sure your inputs are in the correct
order. In command line interfaces such as cwltool, your input should look like the following.
```json
{
...,
"run_pe_reads_processing": true,
"run_se_reads_processing": false,
"run_bam_processing": false,
"input_pe_reads_list": [
{
"class": "File",
"location": "/path/to/sample1_R1.fastq"
},
{
"class": "File",
"location": "/path/to/sample2_R1fastq"
},
{
"class": "File",
"location": "/path/to/sample3_R1.fastq"
}
],
"input_pe_mates_list": [
{
"class": "File",
"location": "/path/to/sample1_R2.fastq"
},
{
"class": "File",
"location": "/path/to/sample2_R2.fastq"
},
{
"class": "File",
"location": "/path/to/sample3_R2.fastq"
}
],
"inputs_pe_rgs_list": [
"@RG\\tID:sample1\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name",
"@RG\\tID:sample2\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name",
"@RG\\tID:sample3\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name"
],
...
}
```
##### Multiple Input Types
As mentioned above, these three input types can be added in any combination. If you wanted to add
all three your command line input would look like the following.
```json
{
...,
"run_pe_reads_processing": true,
"run_se_reads_processing": true,
"run_bam_processing": true,
"input_bam_list": [
{
"class": "File",
"location": "/path/to/bam1.bam"
},
{
"class": "File",
"location": "/path/to/bam2.bam"
}
],
"input_se_reads_list": [
{
"class": "File",
"location": "/path/to/single1.fastq"
},
{
"class": "File",
"location": "/path/to/single2.fastq"
}
],
"inputs_se_rgs_list": [
"@RG\\tID:single1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name",
"@RG\\tID:single2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name"
],
"input_pe_reads_list": [
{
"class": "File",
"location": "/path/to/sample1_R1.fastq"
},
{
"class": "File",
"location": "/path/to/sample2_R1fastq"
},
{
"class": "File",
"location": "/path/to/sample3_R1.fastq"
}
],
"input_pe_mates_list": [
{
"class": "File",
"location": "/path/to/sample1_R2.fastq"
},
{
"class": "File",
"location": "/path/to/sample2_R2.fastq"
},
{
"class": "File",
"location": "/path/to/sample3_R2.fastq"
}
],
"inputs_pe_rgs_list": [
"@RG\\tID:sample1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name",
"@RG\\tID:sample2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name",
"@RG\\tID:sample3\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name"
],
...
}
```
#### Example Runtimes:
1. 120 GB WGS BAM with AggMetrics, WgsMetrics, and gVCF creation: 14 hours & $35
1. 120 GB WGS BAM only: 11 hours
1. 4x40 GB WGS FASTQ files with AggMetrics, WgsMetrics, and gVCF creation: 23 hours & $72
1. 4x40 GB WGS FASTQ files only: 18 hours
1. 4x9 GB WXS FASTQ files with AggMetrics and gVCF creation: 4 hours & $9
1. 4x9 GB WXS FASTQ files only: 3 hours
#### Caveats:
1. Duplicates are flagged in a process that is connected to bwa mem. The implication of this design
decision is that duplicates are flagged only on the inputs of that are scattered into bwa.
Duplicates, therefore, are not being flagged at a library level and, for large BAM and FASTQ inputs,
duplicates are only being detected within a portion of the read group.
#### Tips for running:
1. For the fastq input file lists (PE or SE), make sure the lists are properly ordered. The items in
the arrays are processed based on their position. These lists are dotproduct scattered. This means
that the first file in `input_pe_reads_list` is run with the first file in `input_pe_mates_list`
and the first string in `input_pe_rgs_list`. This also means these arrays must be the same
length or the workflow will fail.
1. The input for the reference_tar must be a tar file containing the reference fasta along with its indexes.
The required indexes are `[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa, picard, and samtools.
Additionally, an `.64.alt` index is recommended.
1. If you are making your own bwa indexes make sure to use the `-6` flag to obtain the `.64` version of the
indexes. Indexes that do not match this naming schema will cause a failure in certain runner ecosystems.
1. Should you decide to create your own reference indexes and omit the ALT index file from the reference,
or if its naming structure mismatches the other indexes, then your alignments will be equivalent to the results you would
obtain if you run BWA-MEM with the -j option.
1. The following is an example of a complete reference tar input:
```
~ tar tf Homo_sapiens_assembly38.tgz
Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta
Homo_sapiens_assembly38.fasta.64.alt
Homo_sapiens_assembly38.fasta.64.amb
Homo_sapiens_assembly38.fasta.64.ann
Homo_sapiens_assembly38.fasta.64.bwt
Homo_sapiens_assembly38.fasta.64.pac
Homo_sapiens_assembly38.fasta.64.sa
Homo_sapiens_assembly38.fasta.fai
```
1. For advanced usage, you can skip the knownsite indexing by providing the knownsites_indexes input.
This file list should contain the indexes for each of the files in your knownsites input. Please
note this list must be ordered in such a way where the position of the index file in the
knownsites_indexes list must correspond with the position of the VCF file in the knownsites list
that it indexes. In the example input below you can see that the 1000G_omni2.5.hg38.vcf.gz.tbi
file is the fourth item in the knownsites_indexes because the 1000G_omni2.5.hg38.vcf.gz file is the
fourth item in the knownsites list. Failure to order in this way will result in the pipeline
failing or generating erroneous files.
1. Turning off gVCF creation and metrics collection for a minimal successful run.
1. Suggested reference inputs (available from the [Broad Resource Bundle](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0)):
```yaml
contamination_sites_bed: Homo_sapiens_assembly38.contam.bed
contamination_sites_mu: Homo_sapiens_assembly38.contam.mu
contamination_sites_ud: Homo_sapiens_assembly38.contam.UD
dbsnp_vcf: Homo_sapiens_assembly38.dbsnp138.vcf
reference_tar: Homo_sapiens_assembly38.tgz
knownsites:
- Homo_sapiens_assembly38.known_indels.vcf.gz
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
- 1000G_phase1.snps.high_confidence.hg38.vcf.gz
- 1000G_omni2.5.hg38.vcf.gz
knownsites_indexes:
- Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
- 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
- 1000G_omni2.5.hg38.vcf.gz.tbi
```
requirements:
- class: ScatterFeatureRequirement
- class: MultipleInputFeatureRequirement
- class: SubworkflowFeatureRequirement
inputs:
input_bam_list: {type: 'File[]?', doc: "List of input BAM files"}
input_pe_reads_list: {type: 'File[]?', doc: "List of input R1 paired end fastq reads"}
input_pe_mates_list: {type: 'File[]?', doc: "List of input R2 paired end fastq reads"}
input_pe_rgs_list: {type: 'string[]?', doc: "List of RG strings to use in PE processing"}
input_se_reads_list: {type: 'File[]?', doc: "List of input singlie end fastq reads"}
input_se_rgs_list: {type: 'string[]?', doc: "List of RG strings to use in SE processing"}
reference_tar: {type: File, doc: "Tar file containing a reference fasta and, optionally,\
\ its complete set of associated indexes (samtools, bwa, and picard)", sbg:suggestedValue: {
class: File, path: 5f4ffff4e4b0370371c05153, name: Homo_sapiens_assembly38.tgz}}
biospecimen_name: {type: string, doc: "String name of biospcimen"}
output_basename: {type: string, doc: "String to use as the base for output filenames"}
dbsnp_vcf: {type: 'File?', doc: "dbSNP vcf file", sbg:suggestedValue: {class: File,
path: 6063901f357c3a53540ca84b, name: Homo_sapiens_assembly38.dbsnp138.vcf}}
dbsnp_idx: {type: 'File?', doc: "dbSNP vcf index file", sbg:suggestedValue: {class: File,
path: 6063901e357c3a53540ca834, name: Homo_sapiens_assembly38.dbsnp138.vcf.idx}}
knownsites: {type: 'File[]', doc: "List of files containing known polymorphic sites\
\ used to exclude regions around known polymorphisms from analysis", sbg:suggestedValue: [
{class: File, path: 6063901e357c3a53540ca835, name: 1000G_omni2.5.hg38.vcf.gz},
{class: File, path: 6063901c357c3a53540ca80f, name: 1000G_phase1.snps.high_confidence.hg38.vcf.gz},
{class: File, path: 60639017357c3a53540ca7d0, name: Homo_sapiens_assembly38.known_indels.vcf.gz},
{class: File, path: 6063901a357c3a53540ca7f3, name: Mills_and_1000G_gold_standard.indels.hg38.vcf.gz}]}
knownsites_indexes: {type: 'File[]?', doc: "Corresponding indexes for the knownsites.\
\ File position in list must match with its corresponding VCF's position in\
\ the knownsites file list. For example, if the first file in the knownsites\
\ list is 1000G_omni2.5.hg38.vcf.gz then the first item in this list must be\
\ 1000G_omni2.5.hg38.vcf.gz.tbi. Optional, but will save time/cost on indexing.",
sbg:suggestedValue: [{class: File, path: 60639016357c3a53540ca7b1, name: 1000G_omni2.5.hg38.vcf.gz.tbi},
{class: File, path: 6063901e357c3a53540ca845, name: 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi},
{class: File, path: 6063901c357c3a53540ca80d, name: Homo_sapiens_assembly38.known_indels.vcf.gz.tbi},
{class: File, path: 6063901c357c3a53540ca806, name: Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi}]}
contamination_sites_bed: {type: 'File?', doc: ".bed file for markers used in this\
\ analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)", sbg:suggestedValue: {
class: File, path: 6063901e357c3a53540ca833, name: Homo_sapiens_assembly38.contam.bed}}
contamination_sites_mu: {type: 'File?', doc: ".mu matrix file of genotype matrix",
sbg:suggestedValue: {class: File, path: 60639017357c3a53540ca7cd, name: Homo_sapiens_assembly38.contam.mu}}
contamination_sites_ud: {type: 'File?', doc: ".UD matrix file from SVD result of\
\ genotype matrix", sbg:suggestedValue: {class: File, path: 6063901f357c3a53540ca84f,
name: Homo_sapiens_assembly38.contam.UD}}
wgs_calling_interval_list: {type: 'File?', doc: "WGS interval list used to aid scattering\
\ Haplotype caller", sbg:suggestedValue: {class: File, path: 60639018357c3a53540ca7df,
name: wgs_calling_regions.hg38.interval_list}}
wgs_coverage_interval_list: {type: 'File?', doc: "An interval list file that contains\
\ the positions to restrict the wgs metrics assessment", sbg:suggestedValue: {
class: File, path: 6063901c357c3a53540ca813, name: wgs_coverage_regions.hg38.interval_list}}
wgs_evaluation_interval_list: {type: 'File?', doc: "Target intervals to restrict\
\ gvcf metric analysis (for VariantCallingMetrics)", sbg:suggestedValue: {class: File,
path: 60639017357c3a53540ca7d3, name: wgs_evaluation_regions.hg38.interval_list}}
wxs_bait_interval_list: {type: 'File?', doc: "An interval list file that contains\
\ the locations of the WXS baits used (for HsMetrics)"}
wxs_target_interval_list: {type: 'File?', doc: "An interval list file that contains\
\ the locations of the WXS targets (for HsMetrics)"}
run_bam_processing: {type: boolean, doc: "BAM processing will be run. Requires:\
\ input_bam_list"}
run_pe_reads_processing: {type: boolean, doc: "PE reads processing will be run.\
\ Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list"}
run_se_reads_processing: {type: boolean, doc: "SE reads processing will be run.\
\ Requires: input_se_reads_list, input_se_rgs_list"}
run_hs_metrics: {type: boolean, doc: "HsMetrics will be collected. Only recommended\
\ for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list"}
run_wgs_metrics: {type: boolean, doc: "WgsMetrics will be collected. Only recommended\
\ for WGS inputs. Requires: wgs_coverage_interval_list"}
run_agg_metrics: {type: boolean, doc: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics,\
\ QualityScoreDistribution, and SequencingArtifactMetrics will be collected.\
\ Recommended for both WXS and WGS inputs."}
run_sex_metrics: {type: boolean, doc: "idxstats will be collected and X/Y ratios\
\ calculated"}
run_gvcf_processing: {type: boolean, doc: "gVCF will be generated. Requires: dbsnp_vcf,\
\ contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list,\
\ wgs_evaluation_interval_list"}
min_alignment_score: {type: 'int?', default: 30, doc: "For BWA MEM, Don't output\
\ alignment with score lower than INT. This option only affects output."}
outputs:
cram: {type: File, outputSource: samtools_bam_to_cram/output, doc: "(Re)Aligned\
\ Reads File"}
gvcf: {type: 'File[]?', outputSource: generate_gvcf/gvcf, doc: "Genomic VCF generated\
\ from the realigned alignment file."}
verifybamid_output: {type: 'File[]?', outputSource: generate_gvcf/verifybamid_output,
doc: "Ouput from VerifyBamID that is used to calculate contamination."}
bqsr_report: {type: File, outputSource: gatk_gatherbqsrreports/output, doc: "Recalibration\
\ report from BQSR."}
gvcf_calling_metrics: {type: ['null', {type: array, items: {type: array, items: File}}],
outputSource: generate_gvcf/gvcf_calling_metrics, doc: "General metrics for gVCF\
\ calling quality."}
hs_metrics: {type: 'File[]?', outputSource: picard_collecthsmetrics/output, doc: "Picard\
\ CollectHsMetrics metrics for the analysis of target-capture sequencing experiments."}
wgs_metrics: {type: 'File[]?', outputSource: picard_collectwgsmetrics/output, doc: "Picard\
\ CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing\
\ experiments."}
alignment_metrics: {type: 'File[]?', outputSource: picard_collectalignmentsummarymetrics/output,
doc: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment\
\ of reads within a SAM file."}
gc_bias_detail: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/detail,
doc: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within\
\ windows of a certain GC bin on the reference genome."}
gc_bias_summary: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/summary,
doc: "Picard CollectGcBiasMetrics high level metrics that capture how biased the\
\ coverage in a certain lane is."}
gc_bias_chart: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/chart,
doc: "Picard CollectGcBiasMetrics plot of GC bias."}
insert_metrics: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/metrics,
doc: "Picard CollectInsertSizeMetrics metrics about the insert size distribution\
\ of a paired-end library."}
insert_plot: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/plot,
doc: "Picard CollectInsertSizeMetrics insert size distribution plotted."}
artifact_bait_bias_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_detail_metrics,
doc: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down\
\ by context."}
artifact_bait_bias_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_summary_metrics,
doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait\
\ bias artifact."}
artifact_error_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/error_summary_metrics,
doc: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of\
\ the context-specific error rates, to provide global error rates per type of\
\ base substitution."}
artifact_pre_adapter_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_detail_metrics,
doc: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down\
\ by context."}
artifact_pre_adapter_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_summary_metrics,
doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter\
\ artifact."}
qual_metrics: {type: 'File[]?', outputSource: picard_qualityscoredistribution/metrics,
doc: "Quality metrics for the realigned CRAM."}
qual_chart: {type: 'File[]?', outputSource: picard_qualityscoredistribution/chart,
doc: "Visualization of quality metrics."}
idxstats: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/output, doc: "samtools\
\ idxstats of the realigned BAM file."}
xy_ratio: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/ratio, doc: "Text\
\ file containing X and Y reads statistics generated from idxstats."}
steps:
untar_reference:
run: ../tools/untar_indexed_reference.cwl
in:
reference_tar: reference_tar
out: [fasta, fai, dict, alt, amb, ann, bwt, pac, sa]
bundle_secondaries:
run: ../tools/bundle_secondaryfiles.cwl
in:
primary_file: untar_reference/fasta
secondary_files:
source: [untar_reference/fai, untar_reference/dict, untar_reference/alt, untar_reference/amb,
untar_reference/ann, untar_reference/bwt, untar_reference/pac, untar_reference/sa]
linkMerge: merge_flattened
out: [output]
index_knownsites:
run: ../tools/tabix_index.cwl
in:
input_file: knownsites
input_index: knownsites_indexes
scatter: [input_file, input_index]
scatterMethod: dotproduct
out: [output]
gatekeeper:
run: ../tools/gatekeeper.cwl
in:
run_bam_processing: run_bam_processing
run_pe_reads_processing: run_pe_reads_processing
run_se_reads_processing: run_se_reads_processing
run_hs_metrics: run_hs_metrics
run_wgs_metrics: run_wgs_metrics
run_agg_metrics: run_agg_metrics
run_gvcf_processing: run_gvcf_processing
out: [scatter_bams, scatter_pe_reads, scatter_se_reads, scatter_gvcf, scatter_hs_metrics,
scatter_wgs_metrics, scatter_agg_metrics]
process_bams:
run: ../subworkflows/kfdrc_process_bamlist.cwl
in:
input_bam_list: input_bam_list
indexed_reference_fasta: bundle_secondaries/output
sample_name: biospecimen_name
conditional_run: gatekeeper/scatter_bams
min_alignment_score: min_alignment_score
scatter: conditional_run
out: [unsorted_bams] #+2 Nesting File[][][]
process_pe_reads:
run: ../subworkflows/kfdrc_process_pe_readslist2.cwl
in:
indexed_reference_fasta: bundle_secondaries/output
input_pe_reads_list: input_pe_reads_list
input_pe_mates_list: input_pe_mates_list
input_pe_rgs_list: input_pe_rgs_list
conditional_run: gatekeeper/scatter_pe_reads
min_alignment_score: min_alignment_score
scatter: conditional_run
out: [unsorted_bams] #+0 Nesting File[]
process_se_reads:
run: ../subworkflows/kfdrc_process_se_readslist2.cwl
in:
indexed_reference_fasta: bundle_secondaries/output
input_se_reads_list: input_se_reads_list
input_se_rgs_list: input_se_rgs_list
conditional_run: gatekeeper/scatter_se_reads
min_alignment_score: min_alignment_score
scatter: conditional_run
out: [unsorted_bams] #+0 Nesting File[]
sambamba_merge:
hints:
- class: sbg:AWSInstanceType
value: c5.9xlarge;ebs-gp2;2048
run: ../tools/sambamba_merge_anylist.cwl
in:
bams:
source: [process_bams/unsorted_bams, process_pe_reads/unsorted_bams, process_se_reads/unsorted_bams]
linkMerge: merge_flattened #Flattens all to File[]
base_file_name: output_basename
out: [merged_bam]
sambamba_sort:
hints:
- class: sbg:AWSInstanceType
value: c5.9xlarge;ebs-gp2;2048
run: ../tools/sambamba_sort.cwl
in:
bam: sambamba_merge/merged_bam
base_file_name: output_basename
out: [sorted_bam]
python_createsequencegroups:
run: ../tools/python_createsequencegroups.cwl
in:
ref_dict: untar_reference/dict
out: [sequence_intervals, sequence_intervals_with_unmapped]
gatk_baserecalibrator:
run: ../tools/gatk_baserecalibrator.cwl
in:
input_bam: sambamba_sort/sorted_bam
knownsites: index_knownsites/output
reference: bundle_secondaries/output
sequence_interval: python_createsequencegroups/sequence_intervals
scatter: [sequence_interval]
out: [output]
gatk_gatherbqsrreports:
run: ../tools/gatk_gatherbqsrreports.cwl
in:
input_brsq_reports: gatk_baserecalibrator/output
output_basename: output_basename
out: [output]
gatk_applybqsr:
run: ../tools/gatk_applybqsr.cwl
in:
bqsr_report: gatk_gatherbqsrreports/output
input_bam: sambamba_sort/sorted_bam
reference: bundle_secondaries/output
sequence_interval: python_createsequencegroups/sequence_intervals_with_unmapped
scatter: [sequence_interval]
out: [recalibrated_bam]
picard_gatherbamfiles:
run: ../tools/picard_gatherbamfiles.cwl
in:
input_bam: gatk_applybqsr/recalibrated_bam
output_bam_basename: output_basename
out: [output]
samtools_bam_to_cram:
run: ../tools/samtools_bam_to_cram.cwl
in:
input_bam: picard_gatherbamfiles/output
reference: bundle_secondaries/output
out: [output]
picard_collecthsmetrics:
run: ../tools/picard_collecthsmetrics_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
bait_intervals: wxs_bait_interval_list
target_intervals: wxs_target_interval_list
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_hs_metrics
scatter: conditional_run
out: [output]
picard_collectwgsmetrics:
run: ../tools/picard_collectwgsmetrics_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
intervals: wgs_coverage_interval_list
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_wgs_metrics
scatter: conditional_run
out: [output]
picard_collectalignmentsummarymetrics:
run: ../tools/picard_collectalignmentsummarymetrics_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_agg_metrics
scatter: conditional_run
out: [output]
picard_collectgcbiasmetrics:
run: ../tools/picard_collectgcbiasmetrics_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_agg_metrics
scatter: conditional_run
out: [detail, summary, chart]
picard_collectinsertsizemetrics:
run: ../tools/picard_collectinsertsizemetrics_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_agg_metrics
scatter: conditional_run
out: [metrics, plot]
picard_collectsequencingartifactmetrics:
run: ../tools/picard_collectsequencingartifactmetrics_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_agg_metrics
scatter: conditional_run
out: [bait_bias_detail_metrics, bait_bias_summary_metrics, error_summary_metrics,
pre_adapter_detail_metrics, pre_adapter_summary_metrics]
picard_qualityscoredistribution:
run: ../tools/picard_qualityscoredistribution_conditional.cwl
in:
input_bam: picard_gatherbamfiles/output
reference: bundle_secondaries/output
conditional_run: gatekeeper/scatter_agg_metrics
scatter: conditional_run
out: [metrics, chart]
samtools_idxstats_xy_ratio:
run: ../tools/samtools_idxstats_xy_ratio.cwl
in:
run_idxstats: run_sex_metrics
input_bam: picard_gatherbamfiles/output
out: [output, ratio]
generate_gvcf:
run: ../subworkflows/kfdrc_bam_to_gvcf.cwl
in:
biospecimen_name: biospecimen_name
contamination_sites_bed: contamination_sites_bed
contamination_sites_mu: contamination_sites_mu
contamination_sites_ud: contamination_sites_ud
input_bam: picard_gatherbamfiles/output
indexed_reference_fasta: bundle_secondaries/output
output_basename: output_basename
dbsnp_vcf: dbsnp_vcf
dbsnp_idx: dbsnp_idx
reference_dict: untar_reference/dict
wgs_calling_interval_list: wgs_calling_interval_list
wgs_evaluation_interval_list: wgs_evaluation_interval_list
conditional_run: gatekeeper/scatter_gvcf
scatter: conditional_run
out: [verifybamid_output, gvcf, gvcf_calling_metrics]
$namespaces:
sbg: https://sevenbridges.com
hints:
- class: 'sbg:maxNumberOfParallelInstances'
value: 6
sbg:license: Apache License 2.0
sbg:publisher: KFDRC
sbg:categories:
- ALIGNMENT
- DNA
- WGS
- WXS
- GVCF
sbg:links:
- id: 'https://github.com/kids-first/kf-alignment-workflow/releases/tag/v2.7.3'
label: github-release