Skip to content

7 Usage and Options

Ezequiel L. Nicolazzi edited this page Oct 12, 2015 · 2 revisions

##ZANARDI USAGE Once the parameter file is fully set, the only thing left to do is to run Zanardi from command line, choosing the appropriate analyses. The general usage is (a list of options can be found below):

python Zanardi.py **[options]**

To have a quick reminder of the options available, you can use an internal help routine:

python Zanardi.py -h

or

python Zanardi.py –help

Please remember that the user is expected to interact only with Zanardi's parameter file (PARAMFILE.txt), and its in/outputs.

##OPTIONS Except for a few cases, Zanardi can be run using single or multiple options contemporarily. When multiple options are chosen, Zanardi will streamline the work using the output of the first step as input of the second step, and so on. Notation: • .(xxx) = A set of file extensions that depend on the software used • [< NAME >] = A user-defined < name > for folders / filename suffix, provided in the command line (options --outdir --tempdir) or in the parameter file (OUTPUT_NAME variable), respectively.

List of currently available options

##FULL OPTION EXPLANATION AND REQUIREMENTS ###-h or --help options

  • MEANING:

    This option produces a quick reference to the list of options available on screen.

  • SOFTWARE USED:

    N/A

  • PARAMETER FILE OPTIONS:

    N/A

  • INPUT FILE(S) REQUIRED:

    N/A

  • OUTPUT FILE(S) PRODUCED:

    N/A

###--download option

  • MEANING:

    This option is a stand-alone option. This means that, when invoked, it will terminate the program after running (even if other options are present). This option usually is run very few times (once?) to download the required software. The option calls a small bash script that automatically downloads the required software, uncompress/installs it and updates the link in the parameter file. Single or multiple software downloads can be required contemporarily. For multiple-software download, provide a comma-separated list of available software. Currently available options are: plink, beagle3, beagle4, fcgene and admixture. For example, the following command:

python Zanardi.py --download=beagle3,beagle4,fcGENE,PLINK,ADMIXTURE

will download the required software (all 5 programs), update the path for each software in the parameter file and quit Zanardi. Note the software names are is not case sensitive.

  • SOFTWARE USED:

    Own code

  • PARAMETER FILE OPTIONS:

    N/A

  • INPUT FILE(S) REQUIRED:

    N/A

  • OUTPUT FILE(S) PRODUCED:

    Each software downloaded is placed in a separated folder (except for BEAGLE v.3 and v.4, which are in the same “BEAGLE” folder) under the “UTILS” folder. Note that beagle .jar executables are automatically renamed to beagle3.jar and beagle4.jar.

    !!! WARNING!!! PLINK v1.9 is currently under development. This means the link provided with the current version of Zanardi may be (most probably is) broken and download app won't work correctly. If this happens, don't panic. Just go to PLINK v1.9 download page, copy the link of the Linux 64-bit (STABLE) download button (right click your mouse over the "download" word, and select "copy link" from the menu), and use that link to modify the variable PLINK (currently, row 24) in UTILS/ZANARDI_UTILS/download_app.sh. This tedious procedure won't be necessary once PLINK v1.9 becomes PLINK v2.0.

###--plinkqc option

  • MEANING:

    This option runs a quality control over the data after the merge step (if more than one genotype file is provided). Actually it can also allow to streamline and integrate any PLINK functionality within Zanardi (only for advanced users, see PLINK_OTHOPT parameter variable).

  • SOFTWARE USED:

    PLINK v1.9 (Chang et al., 2015)

  • PARAMETER FILE OPTIONS:

    • QCCRATE_IND: call rate for individuals - values range from 0 to 1;
    • QCCRATE_SNP: call rate for SNPs - values range from 0 to 1;
    • QCMAF: Minor allele frequency - values range from 0 to 1;
    • QCHWE: Hardy-Weinberg Equilibrium - values range from 0 to 1;
    • QC_OTHOPT: apply any other PLINK option using plink syntax – See PLINK manual for further info
  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    Quality controlled PLINK file – after merge (PLINK format): [<OUTDIR>]/PLINK_OUT_[<FILENAME_SUFFIX>].(ped/map/log/...)

###--mds option

  • MEANING:

    This option produces a Multi-dimensional Scaling plot (MDS) over all genotype samples provided.

  • SOFTWARE USED:

    PLINK v1.9 (Chang et al., 2015); R (with ggplot2 package installed)

  • PARAMETER FILE OPTIONS:

    • MDSGROUPop: Group individuals based on the “FID” information of the PLINK file (e.g. first column of the PED file; usually used to provide breed information) – Accepted values Y or N. If Y is chosen, MDS values from all individuals with same FID are averaged together (resulting in a single “dot” for each FID. If N is chosen, then all individuals are plotted, irrespectively of their FID.
  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • PLINK MDS step (PLINK format): [<TEMPDIR>]/MDSPLOT.(ped/map/log)
    • MDS R script (plain text): [<TEMPDIR>]/mds_plot.R
    • MDS plot (in .pdf format): [<OUTDIR>]/MDS_PLOT_(Inds/Pop)_[<FILENAME_SUFFIX>].pdf

###--pedigchk option

  • MEANING:

    This option runs a mendelian inheritance check (e.g. opposing homozygous in close relatives) among all genotyped samples, following the pedigree file provided by the user. IMPORTANT NOTE The total number of SNPs considered and, as a consequence, also the final % of mendelian errors, are those autosomal SNPs that are homozygous in BOTH individuals. For example, if individual (1) has 3 SNPs: AA AB 00 (homozygous for A allele for the first SNP, heterozygous on the second SNP, and with missing alleles in the third SNP) and individual (2) has the following genotype for those 3 SNPs: BB BB BB (homozygous for the B allele in all three SNPs), the % of mendelian error will be 100%: only the first SNP is considered – and it is an opposing homozygote - whereas the second (one heterozygote) and the third (one missing call) are NOT.

  • SOFTWARE USED:

    PLINK v1.9 (Chang et al., 2015); Own code for pedigree check.

  • PARAMETER FILE OPTIONS:

    • PDSKIPCOUPLE: To avoid useless calculations, samples already controlled can be skipped from the pedigree check, using this variable. The format required is the same as the PEDIGCHK_pass.txt file (see OUTPUT FILES PRODUCED) section below for more information.
    • PDMEND_THRES: This is a required parameter, values ranging between 0 and 1, indicating the mendelian inheritance error rate threshold (e.g. a value of .02 means that all individuals with more than 2% of mendelian inheritance error rate will be considered as failing samples).
    • PDBESTALL: This is a required parameter, and must be “Y” or “N”. “Y” means that the best match for all individuals failing the pedigree check will be searched for in the full genotype file (WARNING: highly time consuming if large number of samples genotyped or large number of individuals fail!). “N” means that the search will be restricted to all individuals of the same sex of the failing parent (e.g. if the sire fails, all female individuals will not be checked) and to all individuals born before the target individual itself. Therefore, if you’re not so sure of the accuracy of your pedigree file, use “Y”.
  • INPUT FILE(S) REQUIRED: Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP) A pedigree file (INPUT_PEDIG)

  • OUTPUT FILE(S) PRODUCED:

    • Samples passing pedigree check: [<OUTDIR>]/PEDIGCHK_pass.txt

      Field Content Example
      1 ID son (individual) LukeSkywalker
      2 ID parent Anakin
      3 Type of relationship [SIRE/DAM] SIRE
      4 Mend. Errors/Total nonmissing SNPs 0/1138
      5 Mendelian error (%) 0.00000%

      Semicolon-separated file, with header. NOTE: this file (or files with an identical trace) can be used in subsequent runs to skip already checked couples (PDSKIPCOUPLE variable in parameter file).

    • Samples failing pedigree check: [<OUTDIR>]/PEDIGCHK_fail.txt Trace is identical to PEDIGCHK_pass.txt.

    • (if at least 1 failing sample) Best matching (e.g. potential) parents for failing samples: [<OUTDIR>]/PEDIGCHK_bestmatch.txt

      Field Content Example
      1 Candidate # / Total candidates 1/1
      2 ID son (individual) LukeSkywalker
      3 ID POTENTIAL parent Anakin
      4 Type of relationship [SIRE/DAM] SIRE
      5 Mendelian error (%) 0.00000%

      NOTE: If no plausible best match is found, field 3 will be set to “---“ and field 5 will be set to “999”

###--beagle3 option

  • MEANING:

    This option will run an imputation step using Beagle v.3.

  • SOFTWARE USED:

    FCgene (Roshyara and Scholz, 2014); Beagle v.3 (Browning and Browning, 2007)

  • PARAMETER FILE OPTIONS:

    • BGMEMORY: virtual memory allocated to the process, in MB - default "2000" (e.g. 2Gb)
    • BG3_MISSING: Missing allele coding - default "0"
    • BG_OTHOPT: (OPTIONAL) similarly to PLINK, apply other BEAGLE options (using beagle v.3 syntax. See Beagle v.3 manual for further info).
  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • Conversion from PLINK to BEAGLEv.3 format: [<TEMPDIR>]/PLINK_beagle.(bgl/_fcgene.log/..)
    • Beagle run output (BEAGLE v.3 OUTPUT FILE FORMAT): [<OUTDIR>]/BEAGLE_OUT.(dose.gz/gprobs.gz/phased.gz/..)
    • Conversion from BEAGLEv.3 to PLINK format: [<OUTDIR>]/BEAGLE_OUT_[<FILENAME_SUFFIX>].(ped/map/log)

###--beagle4 option

  • MEANING:

    This option will run an imputation step using Beagle v.4. IMPORTANT: If phase information is required by the user, convert Beagle v.4 format file output on your own. PLINK conversion (e.g. form VCF to PLINK PED/MAP) does not maintain the phase information from Beagle files.

  • SOFTWARE USED:

    PLINK v1.9 (Chang et al., 2015); Beagle v.4 (Browning and Browning, 2009)

  • PARAMETER FILE OPTIONS:

    • BGMEMORY: virtual memory allocated to the process, in MB - default "2000" (e.g. 2Gb)
    • BG_OTHOPT: (OPTIONAL) similarly to PLINK, apply other BEAGLE options (using beagle v.4 syntax. See Beagle v.4 manual for further info).
  • INPUT FILE(S) REQUIRED:

Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP) NOTE: beagle v.4 allows the use of pedigree information (which speeds things up). This option can be included using BG_OTHOPT variable.

  • OUTPUT FILE(S) PRODUCED:

    • Conversion from PLINK to VCF format: [<TMPDIR>]/beagle4_infile.vcf
    • Beagle run output: [<TMPDIR>]/result_beagle4.vcf.gz
    • Conversion from VCF to PLINK format: [<OUTDIR>]/BEAGLE_OUT_[<FILENAME_SUFFIX>].(ped/map/log)

###--fimpute option

  • MEANING:

    This option will run an imputation step using FImpute. IMPORTANT: If phase information is required by the user, convert FImpute format file output on your own. PLINK conversion (e.g. form PLINK PED/MAP to FImpute format) does not maintain the phase information from FImpute files.

  • SOFTWARE USED:

PLINK v1.9 (Chang et al., 2015); FImpute (Sargolzaei et al., 2014)

  • PARAMETER FILE OPTIONS:

    • FMP_NJOB: Number of jobs to be run in parallel - default "1"
    • FMP_OTHOPT: (OPTIONAL) similarly to PLINK, apply other FImpute options. Any other option except for the inclusion of numbers of jobs and the presence of input files are allowed here (using FImpute syntax separated by “;”. See FImpute manual for further info).
  • INPUT FILE(S) REQUIRED:

Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP) NOTE: FImpute allows the use of pedigree information (which speeds things up). This option can’t be included using FMP_OTHOPT variable. If the field INPUT_PEDIG filled, the pedigree is loaded automatically.

  • OUTPUT FILE(S) PRODUCED:
    • Reduction of dimension of genotypes (only BTA19): [<TMPDIR>]/PLINK_HAPREP.(ped/map/log)
    • Conversion from PLINK to VCF format: [<TMPDIR>]/beagle4_infile.vcf
    • Beagle run output: [<TMPDIR>]/result_beagle4.vcf.gz
    • Conversion from VCF to PLINK format: [<OUTDIR>]/BEAGLE_OUT[<FILENAME_SUFFIX>].(ped/map/log)
    • Conversion in 12 PLINK format: [<TMPDIR>]/FIMPUTE_recode12.(ped/map/log)
    • Conversion from PLINK to FImpute format: [<TMPDIR>]/genotype_[<FILENAME_SUFFIX>].FM and [<TMPDIR>]/snp_info_[<FILENAME_SUFFIX>].FM
    • Parameter file for FImpute: [<TMPDIR>]/param_FImpute_[<FILENAME_SUFFIX>].FM
    • Allele frequency using PLINK: [<TMPDIR>]/ freqACGT.(frq/nosex/log)
    • FImpute run output folder: [<OUTDIR>]/output_FImpute_[<FILENAME_SUFFIX>]
    • Conversion from FImpute to PLINK format: [<OUTDIR>]/FIMPUTE_[<FILENAME_SUFFIX>].(ped/map/log)

###--haprep [BSW/FLK] option

  • MEANING:

    This option is standalone (program exits after running), it is intended for 2 COW breeds only: Brown Swiss (BSW) and Flekvieh (FLK). This option will prepare the input file for a web service able to predict if individuals are carriers (or not) of a breed-specific haplotype linked to reduced fertility. This option works only with BovineSNP50 v.2 array (it automatically selects BTA19) and requires raw genotypes, as it selects the SNPs used in the training model (~300 for BSW, ~1100 for FLK). An imputation step (only on BTA19) is run, as the machine learning algorithm used on the post-hoc analysis does not accept missing genotypes. The output file produced by Zanardi is the input file for the web-app that runs the analysis: https://stebif68.shinyapps.io/EzeApp

  • SOFTWARE USED:

PLINK v1.9 (Chang et al., 2015); Beagle v.4 (Browning and Browning, 2009); Own code

  • PARAMETER FILE OPTIONS:

    N/A

  • INPUT FILE(S) REQUIRED:

Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:
    • Reduction of dimension of genotypes (only BTA19): [<TMPDIR>]/PLINK_HAPREP.(ped/map/log)
    • Conversion from PLINK to VCF format: [<TMPDIR>]/beagle4_infile.vcf
    • Beagle run output: [<TMPDIR>]/result_beagle4.vcf.gz
    • Conversion from VCF to PLINK format: [<OUTDIR>]/BEAGLE_OUT[<FILENAME_SUFFIX>].(ped/map/log)
    • Reduce number of SNPs required in output: [<TMPDIR>]/HAPLO_small.txt and [<TMPDIR>]/PLINK_HAPREP.(ped/map)
    • Final output: [<OUTDIR>]/HAPREP[<FILENAME_SUFFIX>].txt

###--roh option

  • MEANING:

This option will search for Runs of Homozygosity individual- and chromosome-wise. Conversely to PLINK --roh option, this is a variable length ROH procedure (e.g. avoids the fixed sliding window procedure). A plot by FID (e.g. first column in PLINK PED file, usually used to identify the breed) and by chromosome is produced using R (+ggplot2 package).

  • SOFTWARE USED:

    PLINK v1.9 (Chang et al., 2015); Own code (Marras et al., 2015)

  • PARAMETER FILE OPTIONS:

    • ROH_SNP: Minimum number of SNP for each ROH (e.g. if a ROH has less than this number is not accounted for).
    • ROH_MAXMIS: Maximum number of missing SNP per ROH (e.g. tolerance for this number of missing SNPs in a ROH).
    • ROH_MAXHET: Maximum number of heterozygous SNP per ROH (e.g. tolerance for this number of heterozygous SNPs – usually used to account for genotyping call error).
    • ROH_MINLEN: Minimum length - in Mb - of ROH
  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • Conversion to 1/2 allele format (PLINK v1.9): [<TMPDIR>]/ROH_recode12.(ped/map)

    • ROH reads output: [<OUTDIR>]/ROH_reads_[<FILENAME_SUFFIX>].txt

      Field Content Example
      1 FID (or breed) Jedi
      2 ID individual LukeSkywalker
      3 Chromosome 1
      4 Count (# SNPs in ROH) 500
      5 Start (ROH bp start) 1000
      6 End (ROH bp end) 50000
      7 Length (ROH length) 49000
    • ROH R script (PLAIN TEXT): [<TMPDIR>]/roh_plot.R

    • ROH plot: [<OUTDIR>]/ROH_plot_[<FILENAME_SUFFIX>].pdf

###--froh option

  • MEANING:

    This option will search for Runs of Homozygosity individual- and chromosome-wise with the objective of obtaining ROH-based inbreeding coefficients. A file including total inbreeding and chromosome-wise inbreeding indexes is provided (by individual), as long as all output files from --roh option.

  • SOFTWARE USED:

    PLINK v1.9 (Chang et al., 2015); Own code (Marras et al., 2015)

  • PARAMETER FILE OPTIONS:

    • ROH_SNP: Minimum number of SNP for each ROH (e.g. if a ROH has less than this number is not accounted for).
    • ROH_MAXMIS: Maximum number of missing SNP per ROH (e.g. tolerance for this number of missing SNPs in a ROH).
    • ROH_MAXHET: Maximum number of heterozygous SNP per ROH (e.g. tolerance for this number of heterozygous SNPs – usually used to account for genotyping call error).
    • ROH_MINLEN: Minimum length - in Mb - of ROH
  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • Conversion to 1/2 allele format (PLINK v1.9): [<TMPDIR>]/ROH_recode12.(ped/map)

    • ROH-based inbreeding coefficients: [<OUTDIR>]/ROH_inbreeding_[<FILENAME_SUFFIX>].txt

      Field Content Example
      1 FID (or breed) Jedi
      2 ID individual LukeSkywalker
      3 Total inbreeding coeff. (FROH) 0.0001
      4-33 Chrom-wise FROH (1-29) 0.1; ... ; 0.1

###--admixture option

PLINK v1.9 (Chang et al., 2015); Admixture (Alexander and Lange, 2011); R (with ggplot2 package installed)

  • PARAMETER FILE OPTIONS:

    • ROH_SNP: Minimum number of SNP for each ROH (e.g. if a ROH has less than this number is not accounted for).
    • ADM_KVALUE: This refers to the number of maximum K runs. K is, in (extremely) simple words, what Admixture uses to cluster individuals assuming K populations. Zanardi will run all K’s from 2 to the desired K. Broadly, if number of population to be analysed is known, then choose a K=known_pops+3. In order to chose the most suitable K for your dataset, see the lowest CV value in the CV plot (created automatically in Zanardi).
    • ADM_CORE: number of processors used in the calculation (the higher, the lower the processing time.
    • ADM_CV: number of cross-validations (general rule: the higher, the better. However, this value is proportional to the processing time.
  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • Conversion to 1/2 allele format (PLINK v1.9): [<TMPDIR>]/Admixture_[<FILENAME_SUFFIX>]_K.(ped/map)
    • ADMIXTURE run (for i=2,K): [<OUTDIR>]/Admixture_[<FILENAME_SUFFIX>]_K.[i].(Q/P)
    • ADMIXTURE CV plot: [<OUTDIR>]/Admixture_CVplot_[<FILENAME_SUFFIX>].pdf
    • ADMIXTURE BAR plot (one file, multiple pages, one for each K): [<OUTDIR>]/Admixture_BARplot_[<FILENAME_SUFFIX>].pdf

###--gsprep option

  • MEANING:

    This option prepares the input files for the GBCPP pipeline (Meuwissen et al., unpublished).

  • SOFTWARE USED:

PLINK v1.9 (Chang et al., 2015)

  • PARAMETER FILE OPTIONS:

    N/A

  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • Conversion to 1/2 allele format (PLINK v1.9): [<TMPDIR>]/GSPREP_recode12.(ped/map/log)
    • Conversion to genotype GBCPP pipeline format: [<OUTDIR>]/GSPREP_[<FILENAME_SUFFIX>].geno
    • Conversion to pedigree GBCPP pipeline format: [<OUTDIR>]/GSPREP_[<FILENAME_SUFFIX>].pedig
    • Conversion to phenotype GBCPP pipeline format: [<OUTDIR>]/GSPREP_[<FILENAME_SUFFIX>].pheno

###--optiprep option

  • MEANING:

    This option prepares the input files for the OPTIMATE software (Varona et al., unpublished).

  • SOFTWARE USED:

PLINK v1.9 (Chang et al., 2015)

  • PARAMETER FILE OPTIONS:

    N/A

  • INPUT FILE(S) REQUIRED:

    Any genotype and map input file (INPUT_PED+INPUT_MAP and/or INPUT_705+INPUT_705_MAP)

  • OUTPUT FILE(S) PRODUCED:

    • Conversion of OPTIPREP map format: [<OUTDIR>]/OPTIMATE_[<FILENAME_SUFFIX>].map
    • Conversion of OPTIPREP pedigree format: [<OUTDIR>]/OPTIMATE_[<FILENAME_SUFFIX>].pedig
    • Conversion of OPTIPREP phenotype format: [<OUTDIR>]/OPTIMATE_[<FILENAME_SUFFIX>].pheno
    • SIRE ped file and trace in (renumbered) pedigree: [<OUTDIR>]/OPTIMATE_SIRE_[<FILENAME_SUFFIX>].(ped/gen)
    • DAM ped file and trace in (renumbered) pedigree: [<OUTDIR>]/OPTIMATE_DAM_[<FILENAME_SUFFIX>].(ped/gen)

###--save

This option prevents Zanardi from deleting the temporary folder at each step. By default Zanardi discards all temporary files, to reduce physical memory usage (as it is conceived to handle large files!)

###--outdir=[new_path]

This option changes the default name/location of the OUTPUT directory (default: ./OUTPUT)

###--tmpdir=[new_path]

This option changes the default name/location of the TEMP directory (default: ./TEMP)

###--debug

This option is generally for internal use. It is a helpful option for debugging. Normal users should not use this option (unless they like their screens flooded with output!)

###-q or --quiet

This option is for minimalist users. If this option is used, no log will be promted on screen. In any case, the log of the latest run will be saved, as usual, in the Zanardi.log file.

Clone this wiki locally