# Goal
Jacobo de la Cuesta-Zuluaga.

Previously, I downloaded all the available *Methanomassiliicoccales* assemblies from NCBI, and ran `CheckM` to determine which were considered at least *substantially complete genomes* (>= 70%) *with low contamination* (<= 5%). Next, I will determine how similar are these assemblies to the **vadinCA11** bin. In this notebook I will do overall similitude tests, using `sourmash` and comparing the average nucleotide identity (ANI) or the average aminoacid identity (AAI). Functional analyses will be performed in a separate notebook.

# Var

In [3]:
# Work dir
work_dir = "/ebio/abt3_projects/vadinCA11/data/V11"
genomes_dir = os.path.join(work_dir, "genomes")

# vadinCA11 genome bin
V11_bin = "/ebio/abt3_projects/vadinCA11/data/metagenome/LLMGA/HiSeqRun83-91/bin_refine/DAS_Tool//bins_DASTool_bins/metabat2_low_PE.478.contigs.fa"
V11_folder = "/ebio/abt3_projects/vadinCA11/data/metagenome/LLMGA/HiSeqRun83-91/bin_refine/DAS_Tool//bins_DASTool_bins/"

# List of included genomes and V11
included_genomes_file =  os.path.join(work_dir, "genomes", "included_genomes.txt")
included_genomes_dir = os.path.join(work_dir, "included_genomes")

# Misc
quality_env = "py2_genome_quality"
metacompass_env = "metacompass"

# Init

In [4]:
import os
import pandas as pd
import subprocess

In [113]:
# I will create a folder with all included genomes to facilitate further work
with open(included_genomes_file, "r") as genomes:
    inclusions = [line.rstrip('\n') for line in genomes]

genomes_input = ""
for i in inclusions:
    genomes_input = "{0}{1} ".format(genomes_input, i)
    
!cp $genomes_input $included_genomes_dir

/ebio/abt3_projects/vadinCA11/data/metagenome/LLMGA/HiSeqRun83-91/bin_refine/DAS_Tool//bins_DASTool_bins/metabat2_low_PE.478.contigs.fa /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_000300255.2.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002503545.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002495495.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002506865.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002499085.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002494805.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002502545.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_003135935.1_20110800.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_000404225.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002498765.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_000800805.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002498545.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_

In [186]:
# Rename the V11 file to something more clear
# It will be named V11_RL001.fna - from VadinCA11 Ruth Ley 001
!cd $included_genomes_dir; mv metabat2_low_PE.478.contigs.fa V11_RL001.fna
!cd $included_genomes_dir; ls -lh V11_RL001.fna

-rw-r--r-- 1 jdelacuesta abt3 1.6M Jun 18 17:55 V11_RL001.fna


# Compare genomes with Sourmash

In [42]:
# Select kmer size to run sourmash

In [114]:
# Compute signatures for all genomes with sourmash
signature_file = os.path.join(work_dir, "sourmash_output") # "genomes.sig
compute_cmd = "cd {0}; sourmash compute --scaled 1000 -k 21,31,51 {1}"
compute_cmd = compute_cmd.format(signature_file, genomes_input)
compute_job = 'bash -c "source activate {0}; {1}"'
compute_job = compute_job.format(metacompass_env, compute_cmd)
print(compute_job)
!$compute_job

bash -c "source activate metacompass; cd /ebio/abt3_projects/vadinCA11/data/V11/sourmash_output; sourmash compute --scaled 1000 -k 21,31,51 /ebio/abt3_projects/vadinCA11/data/metagenome/LLMGA/HiSeqRun83-91/bin_refine/DAS_Tool//bins_DASTool_bins/metabat2_low_PE.478.contigs.fa /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_000300255.2.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002503545.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002495495.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002506865.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002499085.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002494805.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002502545.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_003135935.1_20110800.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_000404225.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genomes/GCA_002498765.1.fna /ebio/abt3_projects/vadinCA11/data/V11/genome

In [151]:
# Compare all signatures with each other
# See http://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#what-k-mer-size-s-should-i-use
kmer_size = 21
compare_file = os.path.join(work_dir, "sourmash_output", "genome_compare.csv")
compare_cmd = "cd {0}; sourmash compare *.sig --csv v11.cmp.csv -k {2}"
compare_cmd = compare_cmd.format(signature_file, compare_file, kmer_size)
compare_job = 'bash -c "source activate {0}; {1}"'
compare_job = compare_job.format(metacompass_env, compare_cmd)
print(compare_job)
!$compare_job

bash -c "source activate metacompass; cd /ebio/abt3_projects/vadinCA11/data/V11/sourmash_output; sourmash compare *.sig --csv v11.cmp.csv -k 21"
[Kloaded 72 signatures total.                                                    
[Kdownsampling to scaled value of 1000
[K
min similarity in matrix: 0.000


I will perform the plotting of the `sourmash` output using R in a [separate notebook](/notebooks/notebooks/metagenome/assembly/HiSeqRuns83-91/5.1_Plots_genome_comparison.ipynb). In case that the plot generated by sourmash is desired, instead of plotting it in R, use the following commands

```python
# Plot dendrogram
plot_cmd = "cd {0}; sourmash plot {1}"
plot_cmd = plot_cmd.format(signature_file, compare_file)
plot_job = 'bash -c "source activate {0}; {1}"'
plot_job = plot_job.format(metacompass_env, plot_cmd)
print(plot_job)
!$plot_job
```

# Calculate the average nucleotide identity (ANI)
The result from `sourmash` allows to rapidly identify the closest genomes to that of the genome of interest, however, it does not works very well with large evolutionary distances (anything beyond genus level). This can be seen in the obtained heatmap, where the Jaccard index between most genomes was close to zero. Considering that I am comparing genomes at the order level, I will calculate the average nucleotide identity between all downloaded genomes. I will use the `pyani` software.

In [142]:
# Create bash script to submit to cluster
script_pyani = """#!/bin/bash
#$ -N {0}
#$ -pe parallel 8
#$ -l h_vmem=32G
#$ -l h_rt=72:0:0
#$ -o $HOME/tmp/SGE/job_stdout
#$ -j y
#$ -wd {1}
#$ -m ea
#$ -M jdelacuesta@tuebingen.mpg.de

export PATH='{1}':$PATH

./average_nucleotide_identity.py -i {2} \
    -o {3} \
    -m ANIb \
    --workers 8 \
    --force \
    --seed 2112
    
"""

In [143]:
# Prepare the parameters to run in the cluster
job_name = "Methanomasilii_ANI"
metacompass_bin = "/ebio/abt3_projects/software/miniconda3/envs/metacompass/bin"
pyani_dir = "{0}/pyani_output".format(work_dir) #os.path.join(work_dir, "pyani_output")
pyani_log = "{0}/pyani_log.txt".format(pyani_dir) #os.path.join(work_dir, "pyani_log.txt")

script_fn = os.path.join(work_dir, "cluster_jobs", job_name + ".sh")
fh = open(script_fn, "w")
fh.write(script_pyani.format(job_name, metacompass_bin, included_genomes_dir, pyani_dir))
fh.close()

In [144]:
# Submit
subprocess.run(["qsub", script_fn])

CompletedProcess(args=['qsub', '/ebio/abt3_projects/vadinCA11/data/V11/cluster_jobs/Methanomasilii_ANI.sh'], returncode=0)

# Drep method
Drep uses a combination of what I already did: it first estimates the MASH distance (an estimate of ANI) between all pairs of genomes and then calculates the ANIm only between genomes with a MASH distance greater than a given threshold. This way it can compare a great number of genomes efficiently.

## Drep compare

In [28]:
drep_output = os.path.join(work_dir, "drep")
if not os.path.exists(drep_output):
    os.makedirs(drep_output)

In [191]:
#Running Drep
# I'm using a genus definition of MASH >= 80% and a species definition of ANIm >= 0.95
drep_cmd = "dRep compare --P_ani 0.80 --S_ani 0.95 {0} -g {1}"
drep_cmd = drep_cmd.format(drep_output, included_genomes_dir+"/*")
drep_job = 'bash -c "source activate {0}; {1}"'
drep_job = drep_job.format(metacompass_env, drep_cmd)
print(drep_job)
!$drep_job

bash -c "source activate metacompass; dRep compare --P_ani 0.80 --S_ani 0.95 /ebio/abt3_projects/vadinCA11/data/V11/drep -g /ebio/abt3_projects/vadinCA11/data/V11/included_genomes/*"
***************************************************
    ..:: dRep compare Step 1. Cluster ::..
***************************************************
    
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
19 primary clusters made
Step 3. Perform secondary clustering
Running 704 ANImf comparisons- should take ~ 58.7 min
Step 4. Return output
***************************************************
    ..:: dRep compare Step 2. Bonus ::..
***************************************************
    
Loading work directory
***************************************************
    ..:: dRep compare Step 3. Evaluate ::..
***************************************************
    
*****************************************

## Drep dereplicate

In addition, I will dereplicate the genomes with `dRep`. Because of issues with python versions, `CheckM` and `dRep`, I will use the already obtained values of completeness and contamination of the included genomes.

### Create CheckM stats table

In [24]:
# Read included genomes checkM data
tab_cM_raw = pd.read_csv("/ebio/abt3_projects/vadinCA11/data/V11/genomes/included_stats.txt", delimiter="\t")

# Select completeness and contamination tables and transpose
tab_cM = tab_cM_raw.loc[0:1,:].transpose()
tab_cM_raw.head()

Unnamed: 0,Stat,V11_RL001,GCA_000300255.2,GCA_002503545.1,GCA_002495495.1,GCA_002506865.1,GCA_002499085.1,GCA_002494805.1,GCA_002502545.1,GCA_000404225.1,...,GCA_002498285.1,GCA_002509465.1,GCA_002503785.1,GCA_900313075.1,GCA_002505245.1,GCA_002508585.1,GCA_002508545.1,GCA_002509415.1,GCA_002498425.1,GCA_002496945.1
0,Completeness,98.38709677419357,98.38709677419357,88.37882547559967,86.20684776457271,83.36326237429802,76.8,92.33870967741936,88.4903791737408,98.79032258064515,...,84.71768751904575,84.74554873536196,96.90860215053765,82.25806451612904,95.56451612903226,88.30024813895781,97.31182795698926,88.93604980192417,90.7258064516129,97.13497453310696
1,Contamination,0.0,0.8064516129032258,0.9408602150537636,0.4032258064516129,4.032258064516129,2.4,0.8064516129032258,0.0,0.8064516129032258,...,0.0,0.0,0.0,2.419354838709677,0.0,0.8064516129032258,0.8064516129032258,0.0,0.0,0.0
2,Genome size,1590932.0,1666795.0,1249378.0,1063745.0,1104259.0,1749312.0,1648656.0,1115929.0,1931651.0,...,1827384.0,969311.0,1308604.0,1262884.0,1175704.0,1883692.0,2360299.0,1123872.0,1224361.0,1401758.0
3,# ambiguous bases,0.0,0.0,1306.0,6646.0,20303.0,479.0,790.0,17388.0,0.0,...,98446.0,5047.0,383.0,0.0,5963.0,74196.0,2392.0,23022.0,5065.0,6936.0
4,# contigs,48.0,1.0,56.0,107.0,175.0,46.0,47.0,135.0,1.0,...,572.0,104.0,22.0,192.0,49.0,464.0,111.0,138.0,38.0,60.0


In [25]:
# Select genomes, completeness and contamination as lists
genome_raw = list(tab_cM.index)[1:]
genome = []
for file in genome_raw:
    newname = file + ".fna"
    genome.append(newname)  

completeness = list(tab_cM.iloc[1:, 0])
contamination = list(tab_cM.iloc[1:, 1])
# Create dictionary and then data frame
genome_dict = {"genome": genome, "completeness": completeness, "contamination": contamination}
dRep_cM = pd.DataFrame(genome_dict)

In [27]:
# Export table
dRep_cM_file = "/ebio/abt3_projects/vadinCA11/data/V11/genomes/CheckM_dRep.txt"
dRep_cM.to_csv(dRep_cM_file, sep=",", index=False)

### Dereplicate

In [49]:
dereplication_output = os.path.join(drep_output, "dereplication")
if not os.path.exists(dereplication_output):
    os.makedirs(dereplication_output)

In [50]:
drep_cmd = "dRep dereplicate --P_ani 0.80 --S_ani 0.95 --genomeInfo {0} {1} -g {2}"
drep_cmd = drep_cmd.format(dRep_cM_file, dereplication_output, included_genomes_dir+"/*")
drep_job = 'bash -c "source activate {0}; {1}"'
drep_job = drep_job.format(metacompass_env, drep_cmd)
print(drep_job)
#!$drep_job

bash -c "source activate metacompass; dRep dereplicate --P_ani 0.80 --S_ani 0.95 --genomeInfo /ebio/abt3_projects/vadinCA11/data/V11/genomes/CheckM_dRep.txt /ebio/abt3_projects/vadinCA11/data/V11/drep/dereplication -g /ebio/abt3_projects/vadinCA11/data/V11/included_genomes/*"


In [47]:
!pwd

/ebio/abt3_projects/small_projects/jdelacuesta/vadinCA11/notebooks/metagenome/assembly/HiSeqRuns83-91


# Phylogeny using universal markers

In addition to the ANI dendrogram I obtained using `drep`, I will generate a phylogeny using a series of universal markers. For this, I will use the `anvio` package. Instructions can be found [here](http://merenlab.org/2017/06/07/phylogenomics/)

In [5]:
anvio_output = os.path.join(work_dir, "anvio_output")
if not os.path.exists(anvio_output):
    os.makedirs(anvio_output)

In [6]:
for genome in os.listdir(included_genomes_dir):
    # create a tmp file with .fa extension
    basename = genome.split(os.extsep)[0]
    basename = os.path.join(anvio_output, basename+".fa")
    in_genome = os.path.join(included_genomes_dir, genome)
    !cat $in_genome > $basename

```python
# Create anvio db
anvio_db_cmd = "anvi-script-FASTA-to-contigs-db {0}"
anvio_db_job = 'bash -c "source activate {0}; {1}"'
for genome in os.listdir(anvio_output):
    # Prepare anvio job
    anvio_db_cmd = anvio_db_cmd.format(genome)
    anvio_db_job = anvio_db_job.format(metacompass_env, anvio_db_cmd)
    print(anvio_db_job)
    
    # Execute job and remove temp file
    #!$anvio_db_job  
```

In [15]:
%%bash
# Create anvio dbs
source activate metacompass
for i in /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/*fa
do
    anvi-script-FASTA-to-contigs-db $i
done



[1;30m[47m:: INPUT DIR: /ebio/abt3_projects/vadinCA11/data/V11/anvio_output, FNAME: GCA_000300255 ...[0m



[1;30m[47m:: RENAMING CONTIGS ...[0m



[1;30m[47m:: GENERATING THE CONTIGS DB ...[0m



[1;30m[47m:: RUNNING HMMs ...[0m



[1;30m[47m:: INPUT DIR: /ebio/abt3_projects/vadinCA11/data/V11/anvio_output, FNAME: GCA_000308215 ...[0m



[1;30m[47m:: RENAMING CONTIGS ...[0m



[1;30m[47m:: GENERATING THE CONTIGS DB ...[0m



[1;30m[47m:: RUNNING HMMs ...[0m



[1;30m[47m:: INPUT DIR: /ebio/abt3_projects/vadinCA11/data/V11/anvio_output, FNAME: GCA_000404225 ...[0m



[1;30m[47m:: RENAMING CONTIGS ...[0m



[1;30m[47m:: GENERATING THE CONTIGS DB ...[0m



[1;30m[47m:: RUNNING HMMs ...[0m



[1;30m[47m:: INPUT DIR: /ebio/abt3_projects/vadinCA11/data/V11/anvio_output, FNAME: GCA_000800805 ...[0m



[1;30m[47m:: RENAMING CONTIGS ...[0m



[1;30m[47m:: GENERATING THE CONTIGS DB ...[0m



[1;30m[47m:: RUNNING HMMs ...[0m



[1;30m[47m:: INP

In [17]:
# Remove all temporary .fa files
tmp_fa_files = os.path.join(anvio_output, "*.fa")
print(tmp_fa_files)
!rm $tmp_fa_files

/ebio/abt3_projects/vadinCA11/data/V11/anvio_output/*.fa


In [7]:
# List available single copy core genes
# Run this command on the terminal because the character limit
# won't show all the output
external_genomes = os.path.join(anvio_output, "external_genomes.txt")
hmm_cmd = "anvi-get-sequences-for-hmm-hits --external-genomes {0} --list-hmm-sources"
hmm_cmd = hmm_cmd.format(external_genomes)
hmm_job = 'bash -c "source activate {0}; {1}"'
hmm_job = hmm_job.format(metacompass_env, hmm_cmd)
print(hmm_job)
#!$hmm_job

bash -c "source activate metacompass; anvi-get-sequences-for-hmm-hits --external-genomes /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/external_genomes.txt --list-hmm-sources"


The available marker sets are

*HMM Sources common to all 71 genomes*
* [Rinke et al](https://www.nature.com/articles/nature12352) [type: singlecopy] [num genes: 162]
* [Campbell et al](http://www.pnas.org/content/110/14/5540.short) [type: singlecopy] [num genes: 139]
* Ribosomal RNAs [type: Ribosomal_RNAs] [num genes: 12]

The Rinke et al dataset is an rchaeal single-copy core core gene collection. Thus, I will use this in the analysis


In [8]:
# List the genes included in the Rinke marker set
# Run this command on the terminal
hmm_cmd = "anvi-get-sequences-for-hmm-hits --external-genomes {0} \
    --hmm-source Rinke_et_al \
    --list-available-gene-names"
hmm_cmd = hmm_cmd.format(external_genomes)
hmm_job = 'bash -c "source activate {0}; {1}"'
hmm_job = hmm_job.format(metacompass_env, hmm_cmd)
print(hmm_job)

bash -c "source activate metacompass; anvi-get-sequences-for-hmm-hits --external-genomes /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/external_genomes.txt     --hmm-source Ribosomal_RNAs     --list-available-gene-names"


**Rinke_et_al: single copy genes**: ATP-synt_C, ATP-synt_D, ATP-synt_F,
Adenylsucc_synt, AdoHcyase, AdoHcyase_NAD, AdoMet_Synthase, Archease, B5, CTP-
dep_RFKase, CTP_synth_N, DALR_1, DFP, DHO_dh, DKCLD, DNA_binding_1,
DNA_primase_S, DNA_primase_lrg, DUF137, DUF357, DUF359, DUF46, DUF655, DUF814,
DUF99, Diphthamide_syn, EF1_GNE, EFG_C, EFG_IV, EIF_2_alpha, Enolase_C,
Fibrillarin, GAD, GCD14, GMP_synt_C, HMG-CoA_red, Ham1p_like, IF-2, LigT_PEase,
MAF_flag10, MoaC, Mob_synth_C, NAC, NDK, NMD3, Nop, Nop10p, PGK, PTH2, PcrB,
Plug_translocon, Prefoldin, Prefoldin_2, PyrI, PyrI_C, RNA_pol_A_bac, RNA_pol_N,
RNA_pol_Rpb1_1, RNA_pol_Rpb1_2, RNA_pol_Rpb1_3, RNA_pol_Rpb1_4, RNA_pol_Rpb2_1,
RNA_pol_Rpb2_2, RNA_pol_Rpb2_3, RNA_pol_Rpb2_4, RNA_pol_Rpb2_5, RNA_pol_Rpb2_6,
RNA_pol_Rpb2_7, RNA_pol_Rpb4, RNA_pol_Rpb5_C, RNA_pol_Rpb6, RNase_HII, RS4NT,
Rib_5-P_isom_A, Ribosom_S12_S23, Ribosomal_L1, Ribosomal_L10, Ribosomal_L11,
Ribosomal_L11_N, Ribosomal_L13, Ribosomal_L14, Ribosomal_L15e, Ribosomal_L16,
Ribosomal_L18p, Ribosomal_L19e, Ribosomal_L2, Ribosomal_L21e, Ribosomal_L22,
Ribosomal_L23, Ribosomal_L24e, Ribosomal_L29, Ribosomal_L2_C, Ribosomal_L3,
Ribosomal_L30, Ribosomal_L31e, Ribosomal_L32e, Ribosomal_L37ae, Ribosomal_L37e,
Ribosomal_L39, Ribosomal_L4, Ribosomal_L44, Ribosomal_L5, Ribosomal_L5_C,
Ribosomal_L6, Ribosomal_S11, Ribosomal_S13, Ribosomal_S13_N, Ribosomal_S14,
Ribosomal_S15, Ribosomal_S17, Ribosomal_S17e, Ribosomal_S19, Ribosomal_S19e,
Ribosomal_S2, Ribosomal_S24e, Ribosomal_S27, Ribosomal_S27e, Ribosomal_S28e,
Ribosomal_S3Ae, Ribosomal_S3_C, Ribosomal_S4e, Ribosomal_S5, Ribosomal_S5_C,
Ribosomal_S6e, Ribosomal_S7, Ribosomal_S8, Ribosomal_S8e, Ribosomal_S9, RtcB,
SBDS, SBDS_C, SHS2_Rpb7-N, SNO, SRP19, SRP_SPB, SUI1, Sec61_beta, SecY, Spt4,
Spt5-NGN, TIM, TP6A_N, TRM, Topo-VIb_trans, TruB_N, UPF0004, UPF0086,
V_ATPase_I, Wyosine_form, XPG_I, XPG_N, YjeF_N, dsDNA_bind, eIF-5a, eIF-6,
eIF2_C, tRNA-synt_1c, tRNA-synt_1c_C, tRNA-synt_1d, tRNA_NucTransf2,
tRNA_deacylase, vATP-synt_E

In [11]:
# Obtain concatenated sequence of genes
# Keep genes that are in at least 50 genomes
concatenated_file = os.path.join(anvio_output, "concatenated_proteins.fa")
concat_cmd = "anvi-get-sequences-for-hmm-hits --external-genomes {0} \
    -o {1} \
    --hmm-source Rinke_et_al \
    --return-best-hit \
    --get-aa-sequences \
    --concatenate \
    --min-num-bins-gene-occurs 50"
concat_cmd = concat_cmd.format(external_genomes, concatenated_file)
concat_job = 'bash -c "source activate {0}; {1}"'
concat_job = concat_job.format(metacompass_env, concat_cmd)
print(concat_job)
#!$concat_job

bash -c "source activate metacompass; anvi-get-sequences-for-hmm-hits --external-genomes /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/external_genomes.txt     -o /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/concatenated_proteins.fa     --hmm-source Rinke_et_al     --return-best-hit     --get-aa-sequences     --concatenate     --min-num-bins-gene-occurs 50"


In [12]:
# Construct tree using the concatenated alignment
phylogenome_file = os.path.join(anvio_output, "phylogenomic_tree.txt")
phylogenome_cmd = "anvi-gen-phylogenomic-tree -f {0} \
    -o {1}"
phylogenome_cmd = phylogenome_cmd.format(concatenated_file, phylogenome_file)
phylogenome_job = 'bash -c "source activate {0}; {1}"'
phylogenome_job = phylogenome_job.format(metacompass_env, phylogenome_cmd)
print(phylogenome_job)
#!$phylogenome_job

bash -c "source activate metacompass; anvi-gen-phylogenomic-tree -f /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/concatenated_proteins.fa     -o /ebio/abt3_projects/vadinCA11/data/V11/anvio_output/phylogenomic_tree.txt"


```
Alignment sequence length ....................: 69,674
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Info .........................................: Ignored unknown character X (seen 50197 times)
Refining topology ............................: 25 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs
Info .........................................: Total branch-length 3.633 after 61.65 sec
ML-NNI round 1 ...............................: LogLk = -1234633.807 NNIs 16 max delta 105.55 Time 226.66
Info .........................................: Switched to using 20 rate categories (CAT approximation)
Info .........................................: Rate categories were divided by 0.826 so that average rate = 1.0
Info .........................................: CAT-based log-likelihoods may not be comparable across runs
Info .........................................: Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2 ...............................: LogLk = -1130849.645 NNIs 5 max delta 134.62 Time 357.87
ML-NNI round 3 ...............................: LogLk = -1130820.348 NNIs 2 max delta 6.39 Time 385.20
ML-NNI round 4 ...............................: LogLk = -1130785.069 NNIs 2 max delta 30.91 Time 405.16
ML-NNI round 5 ...............................: LogLk = -1130784.853 NNIs 0 max delta 0.00 Time 416.05
Info .........................................: Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 6 ...............................: LogLk = -1130692.015 NNIs 2 max delta 16.25 Time 536.72 (final)
Optimize all lengths .........................: LogLk = -1130690.891 Time 574.36
```

# Session info

In [196]:
!conda list -n metacompass

# packages in environment at /ebio/abt3_projects/software/miniconda3/envs/metacompass:
#
anvio                     4.0.0                    py35_2    bioconda
aragorn                   1.2.38                        1    bioconda
asn1crypto                0.24.0                   py35_0  
atomicwrites              1.1.5                    py35_0  
attrs                     18.1.0                   py35_0  
backcall                  0.1.0                    py35_0  
barrnap                   0.9                           0    bioconda
bcftools                  1.4.1                         0    bioconda
bcrypt                    3.1.4            py35ha35c455_0  
bedtools                  2.27.1                        1    bioconda
biopython                 1.68                     py35_0    bioconda
blast                     2.5.0                h3727419_3    bioconda
blast-legacy              2.2.26                        0    bioconda
bleach                    2.1.3                    