Skip to content
Jade Cheng edited this page Sep 16, 2018 · 1 revision

Input, FASTA alignment

We need to have FASTA files. The sequences need to match in length and names. Let's use the Isolation model as an example here.

$ ls
a.fasta  b.fasta

$ cat a.fasta | wc -c 
1010013

$ cat b.fasta | wc -c 
1010013

$ head *.fasta -n 3
==> a.fasta <==
> input_1_1
aaaaaaaaaaaaaaaaaXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

==> b.fasta <==
> input_1_1
aaaaaaaaaaaaaaaaaaaaaaaaa-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaCaaGaaaaaa

Preprocessing, ZipHMM data

We need to prepare the ZipHMM data directory according to a model specification.

$ Jocx.py init . iso a.fasta b.fasta 
# creating uncompressed sequence file
# using output directory "./ziphmm_iso_a_b"
# parsing "a.fasta"
# parsing "b.fasta"
# comparing sequence "input_1_1"
# sequence length: 1000000
# creating "./ziphmm_iso_a_b/input_1_1.ziphmm"
# Creating 5-state alignment in directory: ./ziphmm_iso_a_b/input_1_1.ziphmm

$ ls
a.fasta  b.fasta  ziphmm_iso_a_b

$ find ziphmm_iso_a_b/
ziphmm_iso_a_b/
ziphmm_iso_a_b/input_1_1.ziphmm
ziphmm_iso_a_b/input_1_1.ziphmm/data_structure
ziphmm_iso_a_b/input_1_1.ziphmm/nStates2seq
ziphmm_iso_a_b/input_1_1.ziphmm/nStates2seq/5.seq
ziphmm_iso_a_b/input_1_1.ziphmm/original_sequence

Model execution, parameter inference

We can now execute CoalHMM given the matching model and one of the three optimizers, NM, PSO, and GA.

$ Jocx.py run . iso nm 0.001 2000 0.5
# algorithm            = _NMOptimiser
# timeout              = None
# max_executions       = 1
#
# 2016-11-20 22:08:07.350329
#
# param0 = (0.0005000000000000001, 0.002)
# param1 = (999.9999999999998, 4000.0)
# param2 = (0.25, 1.0)
#
# execution	state	score	param0	param1	param2
0	init	-34547.3651534	0.001095650748	1589.08255478	0.283742230795
1	fmin-in	-34645.0430766	0.000712774345468	1029.01830743	0.813449518812
1	fmin-cb	-34589.6802608	0.000748413062741	926.116476683	0.854121994752
1	fmin-cb	-34589.6802608	0.000748413062741	926.116476683	0.854121994752
1	fmin-cb	-34533.1270287	0.000831570069713	857.515256188	0.840564502772
:

$ Jocx.py run . iso pso 0.001 2000 0.5
# algorithm            = _PSOptimiser
# timeout              = None
# max_iterations       = 50
# particle_count       = 50
# max_initial_velocity = 0.02
# omega                = 0.9
# phi_particle         = 0.3
# phi_swarm            = 0.1
#
# 2016-11-20 22:20:20.968101
#
# param0 = (0.0005000000000000001, 0.002)
# param1 = (999.9999999999998, 4000.0)
# param2 = (0.25, 1.0)
#
#
# PARTICLES FOR ITERATION 1
# swarm_fitness           = -34370.3728435
# best_average_fitness    = -34865.0449053
# best_minimum_fitness    = -35829.4334913
# best_maximum_fitness    = -34370.3728435
# current_average_fitness = -34865.0449053
# current_minimum_fitness = -35829.4334913
# current_maximum_fitness = -34370.3728435
#
# gen idv          fitness      param0         param1      param2     best_fitness  best-param0    best-param1  best-param2
    1   0  -34370.37284352  0.00181842  1182.56549665  0.31842394  -34370.37284352   0.00181842  1182.56549665   0.31842394
    1   1  -34372.85439833  0.00184368  1172.75314657  0.64961210  -34372.85439833   0.00184368  1172.75314657   0.64961210
    1   2  -34373.94788272  0.00195363  1070.88584882  0.65517884  -34373.94788272   0.00195363  1070.88584882   0.65517884
    1   3  -34393.65611850  0.00157531  1165.64018865  0.36541651  -34393.65611850   0.00157531  1165.64018865   0.36541651
    1   4  -34399.70364572  0.00154114  1083.40096385  0.71295224  -34399.70364572   0.00154114  1083.40096385   0.71295224
: 

Supported models

Here is the output of the '--help' option. It explains all of the supported models and the required order of the FASTA files supplied for each model.

USAGE:
  python Jocx.py init [options] <exp-folder> <model> <fasta1> <fasta2> ...
  python Jocx.py run <exp-folder> <model> <optimizer> <range1> <range2> ...

COMMANDS
  init            Initializes alignments and zipped sequences needed to run an
                  experiment. This command must be executed once before using
                  the 'run' command.
  run             Runs an experiment. Before running an experiment, the
                  appropriate directories must be initialized using the 'init'
                  command.

OPTIONS
  --chunk-size    Indicates the next argument is the maximum size of an
                  individual alignment produced by the ZipHMM library; units
                  are in symbols. If unspecified, the chunk size is set to the
                  size of the entire sequence.
  --help          Prints this help message and exits

ARGUMENTS
  exp-folder      The path to the experiments folder
  fasta1          The path to the first FASTA-formatted file
  fasta2          The path to the second FASTA-formatted file
  model           The model to use for the experiment; e.g. 'iso' or 'iim'
  optimizer       The optimizer to use for the experiment; e.g. 'nm' or 'pso'
  range1          The range for the first parameter (see README for details)
  range2          The range for the second parameter

DESCRIPTION
  This program executes CoalHMM with a specified model and optimizer using a
  given sequence of data in the format of ZipHMM directories.  The program
  prints to standard output the progression of the estimated parameters and
  the corresponding log likelihood.

  ISOLATION MODEL (iso)
      *
     / \ tau
    A   B
    3 params -> tau, coal_rate, recomb_rate
    2 seqs   -> A, B
    1 group  -> AB

  ISOLATION INITIAL MIGRATION MODEL (iim)
        *
       / \
      /<->\    mig_time
     /     \   iso_time
    A       B
    5 params -> iso_time, mig_time, coal_rate, recomb_rate, mig_rate
    2 seqs   -> A, B
    1 group  -> AB

  ISOLATION 3 HMM MODEL (iso-3hmm)
      *
     / \ tau
    A1  B1
    A2  B2
    3 params -> tau, coal_rate, recomb_rate
    4 seqs   -> A1, A2, B1, B2
    3 groups -> A1B1, A1A2, B1B2

  ISOLATION INITIAL MIGRATION 3 HMM MODEL (iim-3hmm)
        *
       / \
      /<->\   mig_time
     /     \  iso_time
    A1      B1
    A2      B2
    5 params -> iso_time, mig_time, coal_rate, recomb_rate, mig_rate
    4 seqs   -> A1, A2, B1, B2
    3 groups -> A1B1, A1A2, B1B2

  THREE POP ADMIX MODEL (admix)
                   *
                  / \
                 /   \    mig_time
                /_____\
    admix_prop / <-|   \  iso_time
              A    C    B
    5 params -> iso_time, mig_time, coal_rate, recomb_rate, admix_prop
    3 seqs   -> A, B, C
    3 groups -> AC, BC, AB

  THREE POP ADMIX 2 3 MODEL (admix23)
                      *
                     / \     greedy1_time_1a
    buddy23_time_1a /\  \
                   /  \_/\   buddy23_time_2a
       admix_prop /  <-|  \  iso_time
                 A    C    B
    7 params -> iso_time, buddy23_time_1a, buddy23_time_2a, greedy1_time_1a,
                coal_rate, recomb_rate, admix_prop
    3 seqs   -> A, B, C
    3 groups -> AC, BC, AB

  ONE POP ADMIX 2 3 MODEL (admix23-1pop)
                      *
                     / \     greedy1_time_1a
    buddy23_time_1a /\  \
                   /  \_/\   buddy23_time_2a
       admix_prop /  <-|  \  iso_time
                      C1
                      C2
    5 params -> iso_time, mig_time, coal_rate, recomb_rate, admix_prop
    2 seqs   -> C1, C2
    1 groups -> C1C2

  TWO POP ADMIX 2 3 MODEL (admix23-2pop)
                      *
                     / \     greedy1_time_1a
    buddy23_time_1a /\  \
                   /  \_/\   buddy23_time_2a
       admix_prop /  <-|  \  iso_time
                 A1   C1
                 A2   C2
    6 params -> iso_time, buddy23_time, greedy1_time, coal_rate, recomb_rate,
                admix_prop
    4 seqs   -> A1, A2, C1, C2
    3 groups -> A1C1, A1A2, C1C2

  TWO POP ADMIX 2 3 ONE SAMPLE MODEL (admix23-2pop-1sample)
                      *
                     / \     greedy1_time_1a
    buddy23_time_1a /\  \
                   /  \_/\   buddy23_time_2a
       admix_prop /  <-|  \  iso_time
                 A    C
    6 params -> iso_time, buddy23_time, greedy1_time, coal_rate, recomb_rate,
                admix_prop
    2 seqs   -> A, C
    1 group  -> AC

  THREE POP ADMIX 2 3 MODEL 6 HMM (admix23-6hmm)
                      *
                     / \     greedy1_time_1a
    buddy23_time_1a /\  \
                   /  \_/\   buddy23_time_2a
       admix_prop /  <-|  \  iso_time
                 A1   C1   B1
                 A2   C2   B2
    7 params -> iso_time, buddy23_time_1a, buddy23_time_2a, greedy1_time_1a,
                coal_rate, recomb_rate, admix_prop
    6 seqs   -> A1, A2, B1, B2, C1, C2
    6 groups -> A1C1, B1C1, A1B1, A1A2, B1B2, C1C2

  THREE POP ADMIX 2 3 MODEL 15 HMM (admix23-15hmm)
                      *
                     / \     greedy1_time_1a
    buddy23_time_1a /\  \
                   /  \_/\   buddy23_time_2a
       admix_prop /  <-|  \  iso_time
                 A1   C1   B1
                 A2   C2   B2
    7 params -> iso_time, buddy23_time_1a, buddy23_time_2a, greedy1_time_1a,
                coal_rate, recomb_rate, admix_prop
    6 seqs   -> A1, A2, B1, B2, C1, C2
    15 groups-> A1C1, A1C2, A2C1, A2C2, B1C1
                B1C2, B2C1, B2C2, A1B1, A1B2
                A2B1, A2B2, A1A2, B1B2, C1C2

EXAMPLES
  $ python Jocx.py init .../ziphmm iso .../seq1.fa .../seq2.fa
  $ python Jocx.py run  .../ziphmm iso ga
  # algorithm            = _GAOptimiser
  # timeout              = None
  # elite_count          = 1
  :
  # gen idv            fitness          param0          param1          param2
      1   1    -38786.09710120      0.00083995    164.91715870      0.01344298
      1   2    -59912.38796110      0.00098375     31.17524014      0.36805062
  :

BUGS
  Report all bugs to Jade Cheng <ycheng@cs.au.dk>

Copyright (c) 2015-2018 Jade Cheng