# Genomics Data Simulation to Machine Learning ready table 

Jupyter notebook is a great tool for data scientists who are working on Genomics data analysis. We will demonstrate the process of simulation of paired-end fastq files to downstream analysis ready table format with `ART, Cromwell on Azure, GATK and Picard` on Jupyter notebook.

**Here is the coverage of this notebook:**

**1.** Simulate Next Generation Sequencing Data with ART

**2.** Convert fastq paired-end data to uBAM with Cromwell on Azure 

**3.** uBAM to VCF with Cromwell on Azure
 
    3.1.Alignment and Variant Calling with Microsoft Genomics service

**4.** Convert the gVCF file to a table format



**Dependencies:**

This notebook requires the following libraries:

- Azure CLI 

- AzCopy: Please install latest release of the `AzCopy`: https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10

- Cromwell on Azure: Please download the latest release of `CoA` from: https://github.com/microsoft/CromwellOnAzure/releases

- ART: ART is a set of simulation tools to generate synthetic next-generation sequencing reads. Please download the latest version of this tool from:  
https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm

- Picard: Please download the latest release of the tool from https://broadinstitute.github.io/picard/

- Genome Analysis Toolkit (GATK) (*Users need to download `GATK` from Broad Institute's webpage into the same compute environment with this notebook: https://github.com/broadinstitute/gatk/releases*)

- Users need reference genome for using this notebook on their environment: [hg19.fasta](https://azure.microsoft.com/en-us/services/open-datasets/catalog/genomics-reference-genomes/)

**Important information: This notebook is using Python 3.6 kernel**


# 1. Simulate Next Generation Sequencing Data with ART - **Sample Code**



We recommend to use ART ([quote from the ART's website](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm): "_ART is a set of simulation tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data"_) for NGS data simulation.

This is a great tool to simulate a NGS data for different sequencing platforms. Simulated data sets are very close to the real genomics datasets. Users can test their own downstream analysis with the simulated data sets.

In this notebook, we will demonstrate the 'paired sample fastq' simulation with sample codes. Please visit tool's website for further sample codes. 

Please download the ART binary files from this [link](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm) than just call the code in below.

Based on the information on the manual of [ART](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm), parameters of the simulation are defined as follows: 


        -ss  --seqSys   The name of Illumina sequencing system of the built-in profile used for simulation
        
        -i   --in       the filename of input DNA/RNA reference
        
        -p   --paired   indicate a paired-end read simulation or to generate reads from both ends of amplicons
                        NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
                        
        -l   --len      the length of reads to be simulated
          
        -f   --fcov     the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
                        
        -m   --mflen    the mean size of DNA/RNA fragments for paired-end simulations
        
        -s   --sdev     the standard deviation of DNA/RNA fragment size for paired-end simulations.
        
        -o   --out      the prefix of output filename



In [None]:
!./art_illumina -ss HS25 -sam -i hg19.fasta -p -l 150 -f 20 -m 200 -s 10 -o paired_dat

Output report of this function will be: 

                  Paired-end sequencing simulation

    ** Parameters used during run **
        Read Length:    150
        Genome masking 'N' cutoff frequency:    1 in 150
        Fold Coverage:            20X
        Mean Fragment Length:     200
        Standard Deviation:       10
        Profile Type:             Combined
        ID Tag:                   

    ** Quality Profile(s) **
        First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
        First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 

    ** Output files **

      FASTQ Sequence Files: ~ 57.7 GB
         the 1st reads: paired_dat1.fq
         the 2nd reads: paired_dat2.fq 

      ALN Alignment Files: ~ 60.4 GB
         the 1st reads: paired_dat1.aln
         the 2nd reads: paired_dat2.aln

      SAM Alignment File: ~ 129.2 GB
        paired_dat.sam 

Reference: [ART](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm)



# 2. Convert fastq paired-end data to uBAM with Cromwell on Azure 

Users needs to use the ["Sequence data format conversion pipelines on Azure"](https://github.com/microsoft/CromwellOnAzure/blob/master/docs/example-fastq-to-ubam.md/#Example-workflow-to-convert-FASTQ-files-to-uBAM-files) for converting the simulated fastq files to uBAM files. Here is the brief information about this pipeline.

### paired-fastq-to-unmapped-bam :
This WDL converts paired FASTQ to uBAM and adds read group information 

#### Requirements/expectations 
- Pair-end sequencing data in FASTQ format (one file per orientation)
- The following metada descriptors per sample: 
  - readgroup   
  - sample_name
  - library_name
  - platform_unit
  - run_date
  - platform_name
  - sequecing_center
  
#### Outputs 
- Unmapped BAM 

# 3. uBAM to gVCF with Cromwell on Azure


The next step on this notebooks is 'variant calling' analysis for further downstream analysis. We recommend to use 'gatk4 genome processing pipeline' for this phase. Users can use the existed pipelines from ["gatk4-genome-processing-pipeline-azure"](https://github.com/microsoft/gatk4-genome-processing-pipeline-azure/blob/master-azure/README.md). Here is the inputs and outputs of this pipeline.

## gatk4-genome-processing-pipeline
Workflows used for germline processing in whole genome sequence data.

### WholeGenomeGermlineSingleSample :
This WDL pipeline implements data pre-processing and initial variant calling (GVCF
generation) according to the GATK Best Practices (June 2016) for germline SNP and
Indel discovery in human whole-genome sequencing data.

#### Requirements/expectations
- Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format
- One or more read groups, one per uBAM file, all belonging to a single sample (SM)
- Input uBAM files must additionally comply with the following requirements:
- - filenames all have the same suffix (we use ".unmapped.bam")
- - files must pass validation by ValidateSamFile
- - reads are provided in query-sorted order
- - all reads must have an RG tag
- Reference genome must be Hg38 with ALT contigs

#### Outputs 
- Cram, cram index, and cram md5 
- GVCF and its gvcf index 
- BQSR Report
- Several Summary Metrics 

## 3.1. Alignment and Variant Calling with Microsoft Genomics service- Optional

Users can also use the [Microsoft Genomics service](https://azure.microsoft.com/en-us/services/genomics/) for alligment and variant calling process. The Microsoft Genomics client (msgen) is a Python front-end to the web service. It can be
installed like a standard Python package, on Windows or Linux using the Python pip package manager (“pip install msgen”). For each genome sample that you want to process, you create a configuration file containing all the parameters for downloading the data, running the Microsoft Genomics pipeline, and uploading the results:

• Your subscription key to Microsoft
Genomics

• The process to run and its parameters

• Path information and storage account
keys for the input files in either paired
FASTQ, paired compressed FASTQ, or
BAM format, in Azure Storage

• Path information and storage account
key for the location to place the output files in Azure Storage

You can then invoke the msgen client to initiate processing, and monitor progress until the job is complete. The final aligned reads in BAM format, and variant calls in VCF.GZ format will be placed in your designated output container in Azure Storage. The client can easily be incorporated into existing workflows. Here is the sample code for calling Microsoft Genomics service from Python client.

Please visit  [quick start run](https://docs.microsoft.com/en-us/azure/genomics/quickstart-run-genomics-workflow-portal) page for sample job submission to the service.


In [None]:
! ./msgen submit -f ./config.txt -b1 paired_dat1.fq -b2 paired_dat2.fq

# 4. Convert the final gVCF file to a table format -VariantsToTable

The optional final step before downstream analysis is converting gvcf file to a table format for specific parameters. 

Extract fields from a VCF file to a tab-delimited table
This tool extracts specified fields for each variant in a VCF file to a tab-delimited table, which may be easier to work with than a VCF. By default, the tool only extracts PASS or . (unfiltered) variants in the VCF file. Filtered variants may be included in the output by adding the --show-filtered flag. The tool can extract both INFO (i.e. site-level) fields and FORMAT (i.e. sample-level) fields. 

Reference: [Variants to table](https://gatk.broadinstitute.org/hc/en-us/articles/360036882811-VariantsToTable)


**INFO/site-level fields**

Use the `-F` argument to extract INFO fields; each field will occupy a single column in the output file. The field can be any standard VCF column (e.g. CHROM, ID, QUAL) or any annotation name in the INFO field (e.g. AC, AF). The tool also supports the following additional fields:

EVENTLENGTH (length of the event)
TRANSITION (1 for a bi-allelic transition (SNP), 0 for bi-allelic transversion (SNP), -1 for INDELs and multi-allelics)
HET (count of het genotypes)
HOM-REF (count of homozygous reference genotypes)
HOM-VAR (count of homozygous variant genotypes)
NO-CALL (count of no-call genotypes)
TYPE (type of variant, possible values are NO_VARIATION, SNP, MNP, INDEL, SYMBOLIC, and MIXED
VAR (count of non-reference genotypes)
NSAMPLES (number of samples)
NCALLED (number of called samples)
MULTI-ALLELIC (is this variant multi-allelic? true/false)


**FORMAT/sample-level fields**

Use the `-GF` argument to extract FORMAT/sample-level fields. The tool will create a new column per sample with the name "SAMPLE_NAME.FORMAT_FIELD_NAME" e.g. NA12877.GQ, NA12878.GQ.



**Input**

A VCF file to convert to a table

**Output**

A tab-delimited file containing the values of the requested fields in the VCF file.


In [None]:
!./gatk VariantsToTable -V simoutput.g.vcf.gz -F CHROM -F POS -F TYPE -F AC -F AD -F AF -GF DP -GF AD -O outputtable.table

# References

1. Cromwell on Azure: https://github.com/microsoft/CromwellOnAzure/releases

2. ART: https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm

3. Variants to table: https://gatk.broadinstitute.org/hc/en-us/articles/360036882811-VariantsToTable 

4. Picard: https://broadinstitute.github.io/picard/ 

5. AzCopy: https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
   


# NOTICES

THIS NOTEBOOK HAS JUST A SAMPLE CODES. MICROSOFT DOES NOT CLAIM ANY OWNERSHIP ON THESE CODES AND LIBRARIES. MICROSOFT PROVIDES THIS NOTEBOOK AND SAMPLE USE OF ART'S SIMULATION LIBRARIES ON AN “AS IS” BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THIS NOTEBOOK. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THIS NOTEBOOK.

**END OF NOTEBOOK**