Skip to content

hou/NGS

Repository files navigation

Some tools for NGS studies

  • NGS_pipeline.sh
Usage: NGS_pipeline.sh <BAM_file/FASTQ_Files>
  For example:
    NGS_pipeline.sh id1.bam
  OR:
    NGS_pipeline.sh id1_R1.fastq.gz id1_R2.fastq.gz

This script implements the GATK Best Practices (gatk v3.5). This is what this script will do:

If the input file is a BAM file, it will be converted back to FASTQ files then followed by:
 1) mapping (bwa mem)
 2) mark duplicates & sort (Picard)
 3) Indel realignment
 4) Base recalibration
 5) Variant calling (HaplotypeCaller GVCF mode)
If the input files are two FASTQ files, the mapping will be started right away.
  • fixFastq.py
usage: fixFastq.py [-h] [--checkEncodingOnly] fastq1 fastq2 output

Fix FASTQ files (remove singletons, resolve pairs, recode quality scores to
Illumina-1.8 if needed)

positional arguments:
  fastq1               input fastq file for first read in paired end data
  fastq2               input fastq file for second read in paired end data
  output               the prefix of output fastq files

optional arguments:
  -h, --help           show this help message and exit
  --checkEncodingOnly  Use this flag if you only want to check the encoding
  • vcfSummary.py
usage: vcfSummary.py [-h] input output

Get variant- and individual-level summary from a VCF file. For example: AC, AF, missing rate, ... for variants; NVAR, Ti/Tv, ... for subjects

positional arguments:
  input     The VCF input file
  output    The prefix of output files

optional arguments:
  -h, --help  show this help message and exit
  • vcfPedcheck.py
usage: vcfPedcheck.py [-h] [--zeroout] [--me N] vcf fam prefix

Scan the vcf file for Mendelian Errors

positional arguments:
  vcf            The VCF input file
  fam            The fam file
  prefix         The prefix of the output files

optional arguments:
  -h, --help     show this help message and exit
  --zeroout, -z  Create a new vcf file by zeroing out all Mendelian Errors
  --me N         Mark all variants with > N Mendelian error rate (based on
                 trios) in the new vcf file
  • SelectVariants.py
usage: SelectVariants.py [-h] [--genemodel {ensembl,refSeq}] [--maf N]
                          [--splicing] [--frameshift] [--nonsynonymous]
                          [--stopgain] [--stoploss]
                          input output

Select variants based on MAF and functional categories

positional arguments:
  input     input file generated by 'table_annovar.pl'
  output    output file with all selected variants included

optional arguments:
  -h, --help      show this help message and exit
  --genemodel {ensembl,refSeq}
                        which gene model to use (default: ensembl)
  --maf N               variants with MAF > N will be excluded (default: 0.01)
  --splicing            use this flag to select splicing variants
  --frameshift          use this flag to select frameshift-indels
  --nonsynonymous       use this flag to select non-synonymous variants
  --stopgain            use this flag to select stop-gain variants
  --stoploss            use this flag to select stop-loss variants
  • backfillVCF.py
usage: backfillVCF.py [-h] [--subjects] input output

Backfilling is needed when different VCF files were merged. This script will
go through the merged VCF file, and backfill the genotypes to '0/0' when all
subejcts provided by the user have missing calls

positional arguments:
  input        Name of the input file (VCF format)
  output       Name of the output file

optional arguments:
  -h, --help   show this help message and exit
  --subjects   A file that includes groups of subjects for whom the
               backfilling is needed (This file can have multhiple lines, but
               subjects that belong to the same group have to be on a single
               line)

About

Several NGS tools I have developed

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published