For filtered dataset - sample filters on, genotype filters off:

- generate amplicon (gene) trees for pseudohaploid samples
- generate species tree reconciling gene trees

Manual steps: none

In [1]:
%run common.ipynb

## Install and test astral

In [2]:
# astral not packaged in conda, install from github
# output jar
astral = '../../../bin/Astral/astral.5.6.3.jar'
# install
if not os.path.isfile(astral):
    ! wget https://github.com/smirarab/ASTRAL/raw/master/Astral.5.6.3.zip -P {BIN_DIR}
    ! unzip {os.path.join(BIN_DIR, 'Astral.5.6.3.zip')} -d {BIN_DIR}
    ! rm {os.path.join(BIN_DIR, '*.zip*')}

In [3]:
# test installation
! java -jar {astral} -i {os.path.dirname(astral)}/test_data/song_primates.424.gene.tre



This is ASTRAL version 5.6.3
Gene trees are treated as unrooted
424 trees read from ../../../bin/Astral/test_data/song_primates.424.gene.tre
All output trees will be *arbitrarily* rooted at Marmoset

Number of taxa: 14 (14 species)
Taxa: [Marmoset, Orangutan, Human, Chimpanzee, Gorilla, Macaque, Galago, Mouse_Lemur, Tree_Shrew, Rat, Tarsier, Rabbit, Horse, Sloth]
Taxon occupancy: {Human=424, Rat=424, Tarsier=424, Galago=424, Rabbit=424, Macaque=424, Sloth=424, Marmoset=424, Tree_Shrew=424, Chimpanzee=424, Mouse_Lemur=424, Horse=424, Orangutan=424, Gorilla=424}
Number of gene trees: 424
0 trees have missing taxa
Calculating quartet distance matrix (for completion of X)
Species tree distances calculated ...
Building set of clusters (X) from gene trees 
------------------------------
gradient0: 339
Number of Clusters after addition by distance: 339
calculating extra bipartitions to be added at level 1 ...
Adding to X using resolutions of greedy consensus ...
Limit for sigma of degrees:4

## Import and filter sequencing data

In [5]:
seq_data = pd.read_csv(CLUSTERING, dtype={'target':str})
seq_data.columns

Index(['s_Sample', 'target', 'consensus', 'reads', 'species', 'combUID',
       'ag1k_cluster', 'multisp_cluster', 'outlier_genotype', 'split_alleles'],
      dtype='object')

In [6]:
# filter sequencing data 
hq_seq_data = seq_data[(seq_data.species != 'unknown')] # &
#                        (~seq_data.outlier_genotype) & 
#                        (~seq_data.split_alleles)]
display(seq_data.shape, hq_seq_data.shape)

(10057, 10)

(9589, 10)

## Make pseudohaploid samples

Generate two pseudohaploid sample IDs for each of the diploid sample. For homozygous samples, use same sequence for both alleles. For multiple allelic sequences, use first two sequences as these appear in `hq_seq_data`.


In [7]:
def get_first_seq(df):
    return df.consensus.iloc[0]
first_seq = hq_seq_data.groupby(by=['s_Sample','target']).apply(get_first_seq).reset_index()
first_seq['species'] = hq_seq_data.groupby(by=['s_Sample','target'])['species'].max().reset_index(drop=True)
first_seq['hap_Sample'] = first_seq.s_Sample + '-1'
first_seq.head()

Unnamed: 0,s_Sample,target,0,species,hap_Sample
0,Abel-SP24,1,AGCCCGGCACTCGGAGGAGGTTCGCTGACTGCTCTGCCGGTCATTG...,Anopheles_bellator,Abel-SP24-1
1,Abel-SP24,15,AAGCGATCGCGAATCGGGCCGGCCGTGCGCATCCTGGGCATGATCG...,Anopheles_bellator,Abel-SP24-1
2,Abel-SP24,22,AAACGCGCAGCGCACTGCAGAATGAACGCAATCCGGGCAGCAGCTG...,Anopheles_bellator,Abel-SP24-1
3,Abel-SP24,24,AATGTAGCCCCGAGCGCGAAGCCGTCAAGCTGCTGCGCGCCACGTC...,Anopheles_bellator,Abel-SP24-1
4,Abel-SP24,26,CATAAAATGTTGCCGGTGGCTGGTGCCGGCGCACGGTACCCGGGAA...,Anopheles_bellator,Abel-SP24-1


In [8]:
def get_second_seq(df):
    if df.shape[0] > 1:
        return df.consensus.iloc[1]
    else:
        return df.consensus.iloc[0]
second_seq = hq_seq_data.groupby(by=['s_Sample','target']).apply(get_second_seq).reset_index()
second_seq['species'] = hq_seq_data.groupby(by=['s_Sample','target'])['species'].max().reset_index(drop=True)
second_seq['hap_Sample'] = second_seq.s_Sample + '-2'
second_seq.head()

Unnamed: 0,s_Sample,target,0,species,hap_Sample
0,Abel-SP24,1,AGCCCGGCACTCGGAGGAGGTTCGCTGACTGCTCTGCCGGTCATTG...,Anopheles_bellator,Abel-SP24-2
1,Abel-SP24,15,AAGCGATCGCGAATCGGTCCGGCCGTGCGCATCCTGGGCATGATCG...,Anopheles_bellator,Abel-SP24-2
2,Abel-SP24,22,AAACGCGCAGCGCACTGCAGAATGAACGCAATCCGGGCAGCAGCTG...,Anopheles_bellator,Abel-SP24-2
3,Abel-SP24,24,AATGTAGCCCCGAGCGCGAAGCCGTCAAGCTGCTGCGCGCCACGTC...,Anopheles_bellator,Abel-SP24-2
4,Abel-SP24,26,CATAAAATGTTGCCGGTGGCTGGTGCCGGCGCACGGTACCCGGGAA...,Anopheles_bellator,Abel-SP24-2


In [9]:
display(second_seq.shape)
comb_seq = pd.concat([first_seq, second_seq])
comb_seq.shape

(7224, 5)

(14448, 5)

## Species-sample mapping

In [9]:
# sequence-species mapping file
with open(MAPPING, 'w') as o:
    for _, r in comb_seq.groupby('species')['hap_Sample'] \
                      .unique() \
                      .reset_index() \
                      .iterrows():
        o.write('{}:{}\n'.format(r[0], ','.join(r[1])))
! head -1 {MAPPING}

Anopheles_aconitus:VBS00053-1,VBS00055-1,VBS00053-2,VBS00055-2


## Alignment

In [10]:
# seequence export and alignment    
for ampl in AMPLS:
    sys.stdout.write('\rProcessing {}'.format(ampl))
    # subset amplicon data
    ampl_data = comb_seq[comb_seq.target == ampl]
    with open('temp.fa', 'w') as o:
        for (i, row) in ampl_data.iterrows():
            o.write('>{}\n{}\n'.format(row.hap_Sample,
                                       row[0]))
    ! mafft temp.fa > {ALN_SP_TREE.format(ampl)} 2> /dev/null
    ! rm temp.fa
sys.stdout.write('\nDone!\n')

Processing 61
Done!


## Gene trees

In [11]:
# write all trees into single file
! rm -f {GENE_TREES}
for ampl in AMPLS:
    sys.stdout.write('\rProcessing {}'.format(ampl))
    ! FastTree -nt {ALN_SP_TREE.format(ampl)} >> {GENE_TREES} 2> /dev/null
! wc -l {GENE_TREES}

Processing 61      62 data/5_gene_trees.nwk


## Species tree

In [12]:
! java -jar {astral} -i {GENE_TREES} -o {SPECIES_TREE} -a {MAPPING}



This is ASTRAL version 5.6.3
Gene trees are treated as unrooted
62 trees read from data/5_gene_trees.nwk
All output trees will be *arbitrarily* rooted at Anopheles_atroparvus

Number of taxa: 306 (56 species)
Taxa: {Anopheles_sinensis=[anopheles-sinensis-sinensisscaffoldsasins2-1, anopheles-sinensis-sinensisscaffoldsasins2-2, anopheles-sinensis-chinascaffoldsasinc2-1, anopheles-sinensis-chinascaffoldsasinc2-2], Anopheles_oryzalimnetes=[Aory-SP141-2, Aory-SP141-1], Anopheles_vagus=[VBS00158-1, VBS00158-2, VBS00157-2, VBS00157-1, VBS00156-2, VBS00156-1, VBS00154-1, VBS00154-2], Anopheles_farauti=[anopheles-farauti-far1scaffoldsafarf2-1, anopheles-farauti-far1scaffoldsafarf2-2], Anopheles_maculatus_A=[VBS00108-2, VBS00108-1, VBS00107-2, VBS00107-1], Anopheles_maculatus_B=[anopheles-maculatus-maculatus3scaffoldsamacm1-1, anopheles-maculatus-maculatus3scaffoldsamacm1-2], Anopheles_tenebrosus=[Aten-185-2, Aten-185-1, Aten-79-1, Aten-79-2, Aten-954-1, Aten-954-2, Aten-191-2, Aten-191-1, Ate

Species tree distances calculated ...
Will attempt to complete bipartitions from X before adding using a distance matrix.
Building set of clusters (X) from gene trees 
In second round sampling 4 rounds will be done
------------------------------
gradient0: 2275
------------------------------
gradient1: 48
------------------------------
gradient2: 14
------------------------------
gradient3: 18
Number of Clusters after addition by distance: 2355
calculating extra bipartitions to be added at level 1 ...
Adding to X using resolutions of greedy consensus ...
Limit for sigma of degrees:1450
polytomy size limit : 13
discarded polytomies:  [3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 13]
Threshold 0.0:
Threshold 0.01:
Threshold 0.02:
Threshold 0.05:
Threshold 0.1:
polytomy of size 3; rounds with additions with at least 5 support: 0; clusters: 2355
polytomy of size 4; rounds with additions with at least 5 support: 0; clusters: 2355
Threshold 0.2:
polytomy of size 4; rounds with additions with at least 5 

You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 14:
	{Anopheles_bellator}
You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 14:
	{Anopheles_cruzii}
You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 13:
	{Anopheles_bellator, Anopheles_cruzii}
You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 10:
	{Anopheles_aquasalis}
You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 10:
	{Anopheles_oryzalimnetes}
You may want to ig

## Sample tree

In [10]:
with open(SAMPLE_MAPPING, 'w') as o:
    for _, r in comb_seq.groupby('s_Sample')['hap_Sample'] \
                      .unique() \
                      .reset_index() \
                      .iterrows():
        
        o.write('{}:{}\n'.format(r[0], ','.join(r[1])))
! head -1 {SAMPLE_MAPPING}

Abel-SP24:Abel-SP24-1,Abel-SP24-2


In [11]:
! java -jar {astral} -i {GENE_TREES} -o {SAMPLE_TREE} -a {SAMPLE_MAPPING}



This is ASTRAL version 5.6.3
Gene trees are treated as unrooted
62 trees read from data/5_gene_trees.nwk
All output trees will be *arbitrarily* rooted at VBS00112

Number of taxa: 306 (153 species)
Taxa: {Aimp-M0001=[Aimp-M0001-1, Aimp-M0001-2], Avin-B0009=[Avin-B0009-1, Avin-B0009-2], Aten-191=[Aten-191-2, Aten-191-1], anopheles-farauti-far1scaffoldsafarf2=[anopheles-farauti-far1scaffoldsafarf2-1, anopheles-farauti-far1scaffoldsafarf2-2], Athe-6-2=[Athe-6-2-2, Athe-6-2-1], Abro-22=[Abro-22-2, Abro-22-1], Acol-570=[Acol-570-1, Acol-570-2], Athe-6-1=[Athe-6-1-2, Athe-6-1-1], Abro-21=[Abro-21-2, Abro-21-1], Anils-7=[Anils-7-1, Anils-7-2], Avin-M0012=[Avin-M0012-2, Avin-M0012-1], anopheles-gambiae-pimperenascaffoldsagams1=[anopheles-gambiae-pimperenascaffoldsagams1-1, anopheles-gambiae-pimperenascaffoldsagams1-2], anopheles-funestus-fumozchromosomesafunf3=[anopheles-funestus-fumozchromosomesafunf3-1, anopheles-funestus-fumozchromosomesafunf3-2], Avin-B0004=[Avin-B0004-1, Avin-B0004-2], 

Species tree distances calculated ...
Will attempt to complete bipartitions from X before adding using a distance matrix.
Building set of clusters (X) from gene trees 
In second round sampling 2 rounds will be done
------------------------------
gradient0: 6151
------------------------------
gradient1: 846
Number of Clusters after addition by distance: 6997
calculating extra bipartitions to be added at level 1 ...
Adding to X using resolutions of greedy consensus ...
Limit for sigma of degrees:3875
polytomy size limit : 16
discarded polytomies:  [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 9, 9, 11, 15, 15, 16]
Threshold 0.0:
polytomy of size 3; rounds with additions with at least 5 support: 1; clusters: 7003
Threshold 0.01:
polytomy of size 3; rounds with additions with at least 5 support: 0; clusters: 7003
Threshold 0.02:
polytomy of size 4; rounds with additions with at least 5 suppor

polytomy of size 5; rounds with additions with at least 5 support: 0; clusters: 7109
Threshold 0.3333333333333333:
polytomy of size 4; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 3; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 4; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 5; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 3; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 3; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 3; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 5; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 4; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 5; rounds with additions with at least 5 support: 0; clusters: 7109
polytomy of size 4; rounds with add

You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 13:
	{Apal-257}
You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 13:
	{Azie-77, Acou-80, Aten-185, Acou-962, Acou-959, Apal-81, Aten-79, Acou-71, Azie-334, Aten-954, Aten-191, Acou-956, Azie-1032, Azie-1055, Azie-70, Aten-333}
You may want to ignore posterior probabilities and other statistics related to the following branch branch because the effective number of genes impacting it is only 13:
	{Azie-77, Acou-80, Aten-185, Acou-962, Acou-959, Apal-81, Aten-79, Acou-71, Azie-334, Aten-954, Apal-257, Aten-191, Acou-956, Azie-1032, Azie-1055, Azie-70, Aten-333}
((VBS00112-1,VBS00112-2)0.38:0.00565182328245857,((VBS00113-2,VBS00113-1)0.94:0.2592622814761897,((VBS00114-2,VBS00114-1)0.73:0.13040254862920553,((VBS0