# Goal
Jacobo de la Cuesta-Zuluaga.

Now that I classified taxonomically the **vadinCA11** bin as member of the *Methanomassiliicoccales* order, and that I established that the bin can be considered as a near complete genome with low contamination (98.4% and 0%, respectively), I can download the genomes of closely related organisms to perform the comparative genomic analyses. In this notebook I will fetch the assemblies from NCBI and assess their quality. 

# Init

In [14]:
import os
import pandas as pd
import subprocess
import urllib.request
import re

# Var

In [15]:
# Work dir
work_dir = "/ebio/abt3_projects/vadinCA11/data/V11"

# Included genomes dir
included_genomes_dir = os.path.join(work_dir, "included_genomes")

# vadinCA11 genome bin
V11_bin = "/ebio/abt3_projects/vadinCA11/data/metagenome/LLMGA/HiSeqRun83-91/bin_refine/DAS_Tool//bins_DASTool_bins/metabat2_low_PE.478.contigs.fa"
V11_folder = "/ebio/abt3_projects/vadinCA11/data/metagenome/LLMGA/HiSeqRun83-91/bin_refine/DAS_Tool//bins_DASTool_bins/"

# Misc
quality_env = "py2_genome_quality"
metacompass_env = "metacompass"

# Get genomes
I wasn't able to make `ncbi-genome-download` work to get the genomes of all available *Methanomassiliicoccales*. Therefore, I will adapt and use Guillermo's script, which I know works and have already used.

In [16]:
# Get the list of assemblies into the file `assembly_summary.txt`
# In case of needing only bacterial genes, change FTP adress for
# ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt"
assembly_summary = os.path.join(work_dir, "genomes", "assembly_summary.txt")
ncbi_ftp = "ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt"
urllib.request.urlretrieve(ncbi_ftp, assembly_summary)

# Read the file skipping the first line
data_ncbi = pd.read_csv(assembly_summary, sep='\t', skiprows=1);

# Define a simple dataset of the complete genomes only, including the URLs where we can download them.
# complete_genes = data_ncbi[data_ncbi["assembly_level"] == "Complete Genome"][["# assembly_accession", 
#                                                                              "asm_name", "organism_name", 
#                                                                              "ftp_path"]]


# Define a simple dataset of all genomes and assemblies, including the URLs where we can download them.
complete_genes = data_ncbi[["# assembly_accession", "asm_name", "organism_name", "ftp_path"]]

complete_genes.head()

Unnamed: 0,# assembly_accession,asm_name,organism_name,ftp_path
0,GCA_000001215.4,Release 6 plus ISO1 MT,Drosophila melanogaster,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...
1,GCA_000001405.27,GRCh38.p12,Homo sapiens,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...
2,GCA_000001515.5,Pan_tro 3.0,Pan troglodytes,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...
3,GCA_000001545.3,P_pygmaeus_2.0.2,Pongo abelii,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...
4,GCA_000001635.8,GRCm38.p6,Mus musculus,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000...


In [17]:
from Bio import SeqIO
import os


def get_members(accession_list, directory, output_file):
    os.chdir(directory)
    assembly_lst = []
    accession_lst = []
    organism_lst = []
    length_lst = []
    
    for accession in accession_list:
        _df = complete_genes[complete_genes["# assembly_accession"] == accession]
        
        if _df.shape[0] != 1:
            continue
        
        url_dir = _df["ftp_path"].iloc[0]
        asm_name = _df["asm_name"].iloc[0]
        filename = "{}_{}_genomic.fna.gz".format(accession, asm_name)
        
        if not os.path.isfile(filename):
            url_lnk = "{}/{}".format(url_dir, filename)
            #urllib.request.urlretrieve(url_lnk, filename)
            urllib.request.urlretrieve("{}/{}_genomic.fna.gz".format(url_dir, url_dir.split("/")[-1]), filename)
        
        assembly_lst.append(asm_name)
        accession_lst.append(accession)
        organism_lst.append(_df["organism_name"].iloc[0])
        
    _df = pd.DataFrame({"Assembly": assembly_lst, 
                        "Organism": organism_lst, 
                        "Accession": accession_lst})
    _df.to_csv("%s" % output_file)
    
    return(_df)

In [99]:
# Looked for all assemblies available for Methanomassiliicoccales[ORGANISM] in NCBI assembly
# on 07.06.2018
download_folder = os.path.join(work_dir, "genomes")
downloaded_list = os.path.join(download_folder, "Methanomassiliicoccales.csv")
Methanomassiliicoccales_members = ["GCA_000308215.1", "GCA_000300255.2", "GCA_000404225.1", 
                                   "GCA_000800805.1", "GCA_001481295.1", "GCA_001421175.1", 
                                   "GCA_001421185.1", "GCA_900313075.1", "GCA_900314325.1", 
                                   "GCA_002494585.1", "GCA_002494705.1", "GCA_002494805.1", 
                                   "GCA_002494945.1", "GCA_002495125.1", "GCA_002495325.1", 
                                   "GCA_002495495.1", "GCA_002495585.1", "GCA_002495665.1", 
                                   "GCA_002496345.1", "GCA_002496385.1", "GCA_002496785.1", 
                                   "GCA_002496945.1", "GCA_002497075.1", "GCA_002497155.1", 
                                   "GCA_002497195.1", "GCA_002497475.1", "GCA_002497815.1", 
                                   "GCA_002497995.1", "GCA_002498285.1", "GCA_002498365.1", 
                                   "GCA_002498425.1", "GCA_002498545.1", "GCA_002498605.1", 
                                   "GCA_002498765.1", "GCA_002498785.1", "GCA_002498805.1", 
                                   "GCA_002499085.1", "GCA_002502005.1", "GCA_002502165.1", 
                                   "GCA_002502465.1", "GCA_002502545.1", "GCA_002502765.1", 
                                   "GCA_002502925.1", "GCA_002502965.1", "GCA_002503495.1", 
                                   "GCA_002503545.1", "GCA_002503645.1", "GCA_002503785.1", 
                                   "GCA_002503925.1", "GCA_002504405.1", "GCA_002504495.1", 
                                   "GCA_002504525.1", "GCA_002504645.1", "GCA_002505225.1", 
                                   "GCA_002505245.1", "GCA_002505275.1", "GCA_002505345.1", 
                                   "GCA_002505735.1", "GCA_002506175.1", "GCA_002506255.1", 
                                   "GCA_002506325.1", "GCA_002506425.1", "GCA_002506565.1", 
                                   "GCA_002506865.1", "GCA_002506905.1", "GCA_002506985.1", 
                                   "GCA_002506995.1", "GCA_002508545.1", "GCA_002508555.1", 
                                   "GCA_002508585.1", "GCA_002508595.1", "GCA_002508625.1", 
                                   "GCA_002509405.1", "GCA_002509415.1", "GCA_002509425.1", 
                                   "GCA_002509465.1", "GCA_003135935.1", "GCA_003153895.1"]

In [100]:
# Download genome assemblies in fasta format from NCBI
Methanomassiliicoccales = get_members(Methanomassiliicoccales_members, 
                                      download_folder, 
                                      downloaded_list)

In [101]:
# Uncompress genome files
gunzip_job = "gunzip {0}/*.fna.gz".format(genomes_folder)
print(gunzip_job)
!$gunzip_job

gunzip /ebio/abt3_projects/vadinCA11/data/V11/genomes/*.fna.gz


In [102]:
# Rename all files just to have the accession in the name
rename_job = "cd {0}; rename 's/_[a-zA-Z].*_genomic//g' GCA*".format(download_folder)
print(rename_job)
!$rename_job

cd /ebio/abt3_projects/vadinCA11/data/V11/genomes; rename 's/_[a-zA-Z].*_genomic//g' GCA*


In [268]:
# Fix a couple of errors in the name of the files
!pwd
!mv GCA_003135935.1_20110800.fna GCA_003135935.1.fna
!mv GCA_003153895.1_20120700.fna GCA_003153895.1.fna

/ebio/abt3_projects/vadinCA11/data/V11/genomes


In [266]:
# Print table with accessions
Methanomassiliicoccales_df = pd.read_csv(downloaded_list, sep=',', header=0)
Methanomassiliicoccales_df = Methanomassiliicoccales_df.drop(["Unnamed: 0"], axis=1)
Methanomassiliicoccales_df

Unnamed: 0,Accession,Assembly,Organism
0,GCA_000308215.1,ASM30821v1,Methanomassiliicoccus luminyensis B10
1,GCA_000300255.2,ASM30025v2,Candidatus Methanomethylophilus alvus Mx1201
2,GCA_000404225.1,ASM40422v1,Candidatus Methanomassiliicoccus intestinalis ...
3,GCA_000800805.1,ASM80080v1,Candidatus Methanoplasma termitum
4,GCA_001481295.1,ASM148129v1,Candidatus Methanomethylophilus sp. 1R26
5,GCA_001421175.1,ASM142117v1,Methanomassiliicoccales archaeon RumEn M2
6,GCA_001421185.1,ASM142118v1,Methanomassiliicoccales archaeon RumEn M1
7,GCA_900313075.1,Rumen uncultured genome RUG779,uncultured Candidatus Methanomethylophilus sp.
8,GCA_900314325.1,Rumen uncultured genome hRUG898,uncultured Candidatus Methanomethylophilus sp.
9,GCA_002494585.1,ASM249458v1,Methanomassiliicoccaceae archaeon UBA409


In [260]:
# Write file downloaded genomes names and accessions
include_table = os.path.join(download_folder, "included_table.txt")
Methanomassiliicoccales_df.to_csv(include_table, sep='\t')

# Assess quality of downloaded genomes
To guarantee that the downstream analyses are correct, I will first perform a quality assessment of the downloaded genomes using CheckM. This way, I can identify and exclude incomplete or contaminated genomes.

## Run CheckM

In [7]:
# Running CheckM to verify the completness and redundancy of the downloaded genomes
#

checkm_file = os.path.join(work_dir, "checkm_downloaded_genomes", 'CheckM_genomes.txt')
checkm_cmd = """checkm lineage_wf -t {0} \
    -x {1} \
    -f {2} \
    {3} {4}/checkm_downloaded_genomes """
checkm_cmd = checkm_cmd.format(8, "fna", checkm_file, download_folder, work_dir)
checkm_job = 'bash -c "source activate {0}; {1}"'
checkm_job = checkm_job.format(quality_env, checkm_cmd)
print(checkm_job)
!$checkm_job

bash -c "source activate py2_genome_quality; checkm lineage_wf -t 8     -x fna     -f /ebio/abt3_projects/vadinCA11/data/V11/checkm_downloaded_genomes/CheckM_genomes.txt     /ebio/abt3_projects/vadinCA11/data/V11/genomes /ebio/abt3_projects/vadinCA11/data/V11/checkm_downloaded_genomes "

*******************************************************************************
 [CheckM - tree] Placing bins in reference genome tree.
*******************************************************************************

  Identifying marker genes in 78 bins with 8 threads:
    Finished processing 78 of 78 (100.00%) bins.
  Saving HMM info to file.

  Calculating genome statistics for 78 bins with 8 threads:
    Finished processing 78 of 78 (100.00%) bins.

  Extracting marker genes to align.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 32 of 78 (41.03%) bins.

## Check results

In [3]:
checkm_stats = "{0}/checkm_output/storage/bin_stats_ext.tsv".format(work_dir)
chepwdckm_df = pd.read_csv(checkm_stats, sep='\t', skiprows=0, header=None)
checkm_df = eval(chepwdckm_df.iat[0,1]) # Select only the stats from the V11 bin and turn into dict
checkm_df = pd.DataFrame(list(checkm_df.items()))
checkm_df.columns = ["Stat", "Value"]
checkm_df = checkm_df.set_index('Stat')
checkm_df = checkm_df.loc[["Completeness", "Contamination", "Genome size", "# ambiguous bases", 
               "# contigs", "Mean contig length", "N50 (contigs)", "Longest contig",
               "# scaffolds", "Mean scaffold length", "N50 (scaffolds)",
               "Longest scaffold", "GC", "GC std", "# predicted genes", 
               "Translation table", "Coding density", "marker lineage", "# genomes",
               "# markers", "# marker sets" ]]
checkm_df.rename(columns = {"Value": "V11_RL001"}, inplace=True)
checkm_df

Unnamed: 0_level_0,V11_RL001
Stat,Unnamed: 1_level_1
Completeness,98.3871
Contamination,0
Genome size,1590932
# ambiguous bases,0
# contigs,48
Mean contig length,33144.4
N50 (contigs),50438
Longest contig,103078
# scaffolds,48
Mean scaffold length,33144.4


In [4]:
# Summarize the CheckM results on downloaded genomes
checkm_stats = "{0}/checkm_downloaded_genomes/storage/bin_stats_ext.tsv".format(work_dir)
checkm_df = pd.read_csv(checkm_stats, sep='\t', skiprows=0, header=None)

# Names of assemblies
assemblies = list(checkm_df.iloc[:,0])
assemblies

# Create a single df with all results
summary_df = pd.DataFrame()
for i in checkm_df.iloc[:, 1]:
    tmp_df = eval(i)
    tmp_df = pd.DataFrame(list(tmp_df.items()))
    tmp_df.columns = ["Stat", "Value"]
    tmp_df = tmp_df.set_index('Stat')
    tmp_df = tmp_df.loc[["Completeness", "Contamination", "Genome size", "# ambiguous bases", 
               "# contigs", "Mean contig length", "N50 (contigs)", "Longest contig",
               "# scaffolds", "Mean scaffold length", "N50 (scaffolds)",
               "Longest scaffold", "GC", "GC std", "# predicted genes", 
               "Translation table", "Coding density", "marker lineage", "# genomes",
               "# markers", "# marker sets" ]]
    summary_df = pd.concat([summary_df, tmp_df], axis=1)

# Rename columns
summary_df.columns = assemblies
summary_df

Unnamed: 0_level_0,GCA_000300255.2,GCA_002503545.1,GCA_002495495.1,GCA_002506865.1,GCA_002499085.1,GCA_002494805.1,GCA_002502545.1,GCA_000404225.1,GCA_002495125.1,GCA_002498765.1,...,GCA_002503785.1,GCA_900313075.1,GCA_002505245.1,GCA_002508585.1,GCA_002508545.1,GCA_002509415.1,GCA_002494945.1,GCA_002506255.1,GCA_002498425.1,GCA_002496945.1
Stat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Completeness,98.3871,88.3788,86.2068,83.3633,76.8,92.3387,88.4904,98.7903,65.0538,95.0775,...,96.9086,82.2581,95.5645,88.3002,97.3118,88.936,64.0521,61.1559,90.7258,97.135
Contamination,0.806452,0.94086,0.403226,4.03226,2.4,0.806452,0,0.806452,0,0,...,0,2.41935,0,0.806452,0.806452,0,0,0.806452,0,0
Genome size,1666795,1249378,1063745,1104259,1749312,1648656,1115929,1931651,882347,1174624,...,1308604,1262884,1175704,1883692,2360299,1123872,869682,823014,1224361,1401758
# ambiguous bases,0,1306,6646,20303,479,790,17388,0,21968,10307,...,383,0,5963,74196,2392,23022,22328,26401,5065,6936
# contigs,1,56,107,175,46,47,135,1,158,66,...,22,192,49,464,111,138,216,223,38,60
Mean contig length,1.6668e+06,22287,9879.43,6194.03,38018.1,35061,8137.34,1.93165e+06,5445.44,17641.2,...,59464.6,6577.52,23872.3,3899.78,21242.4,7977.17,3922.94,3572.26,32086.7,23247
N50 (contigs),1666795,45710,16791,13507,56845,64245,25662,1931651,37789,135661,...,102142,8388,46981,7813,46502,21846,7750,6129,55747,49812
Longest contig,1666795,99237,55665,44660,154585,269292,103911,1931651,86336,270640,...,181776,35746,122573,40576,119060,49127,31952,23521,153909,136345
# scaffolds,1,34,80,58,37,36,38,1,20,29,...,12,192,16,144,68,42,95,93,15,24
Mean scaffold length,1.6668e+06,36746.4,13296.8,19038.9,47278.7,45796,29366.6,1.93165e+06,44117.3,40504.3,...,109050,6577.52,73481.5,13081.2,34710.3,26758.9,9154.55,8849.61,81624.1,58406.6


In [5]:
# Select only substantially complete genomes (>= 70%) with low contamination (<= 5%)
complete = summary_df.loc['Completeness'] >= 70
non_contaminated = summary_df.loc['Contamination'] <= 5
# Summary of included
include = complete & non_contaminated
include_df = summary_df.loc[: , include]
# Summary of excluded
exclude = ~(summary_df.columns.isin(include_df.columns))
exclude_df = summary_df.loc[: , exclude]

In [51]:
# Table with genome statistics of included
include_df = pd.concat([checkm_df, include_df], axis=1)
included_stats = os.path.join(download_folder, "included_stats.txt")
include_df.to_csv(included_stats, sep='\t')

In [48]:
include_df

Unnamed: 0_level_0,V11_RL001,GCA_000300255.2,GCA_002503545.1,GCA_002495495.1,GCA_002506865.1,GCA_002499085.1,GCA_002494805.1,GCA_002502545.1,GCA_000404225.1,GCA_002498765.1,...,GCA_002498285.1,GCA_002509465.1,GCA_002503785.1,GCA_900313075.1,GCA_002505245.1,GCA_002508585.1,GCA_002508545.1,GCA_002509415.1,GCA_002498425.1,GCA_002496945.1
Stat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Completeness,98.3871,98.3871,88.3788,86.2068,83.3633,76.8,92.3387,88.4904,98.7903,95.0775,...,84.7177,84.7455,96.9086,82.2581,95.5645,88.3002,97.3118,88.936,90.7258,97.135
Contamination,0,0.806452,0.94086,0.403226,4.03226,2.4,0.806452,0.0,0.806452,0.0,...,0.0,0.0,0.0,2.41935,0.0,0.806452,0.806452,0.0,0.0,0.0
Genome size,1590932,1666795,1249378,1063745,1104259,1749312,1648656.0,1115929.0,1931651.0,1174624.0,...,1827384.0,969311.0,1308604.0,1262884.0,1175704.0,1883692.0,2360299.0,1123872.0,1224361.0,1401758.0
# ambiguous bases,0,0,1306,6646,20303,479,790.0,17388.0,0.0,10307.0,...,98446.0,5047.0,383.0,0.0,5963.0,74196.0,2392.0,23022.0,5065.0,6936.0
# contigs,48,1,56,107,175,46,47.0,135.0,1.0,66.0,...,572.0,104.0,22.0,192.0,49.0,464.0,111.0,138.0,38.0,60.0
Mean contig length,33144.4,1.6668e+06,22287,9879.43,6194.03,38018.1,35061.0,8137.34,1931650.0,17641.2,...,3022.62,9271.77,59464.6,6577.52,23872.3,3899.78,21242.4,7977.17,32086.7,23247.0
N50 (contigs),50438,1666795,45710,16791,13507,56845,64245.0,25662.0,1931651.0,135661.0,...,7077.0,15368.0,102142.0,8388.0,46981.0,7813.0,46502.0,21846.0,55747.0,49812.0
Longest contig,103078,1666795,99237,55665,44660,154585,269292.0,103911.0,1931651.0,270640.0,...,71752.0,40716.0,181776.0,35746.0,122573.0,40576.0,119060.0,49127.0,153909.0,136345.0
# scaffolds,48,1,34,80,58,37,36.0,38.0,1.0,29.0,...,110.0,69.0,12.0,192.0,16.0,144.0,68.0,42.0,15.0,24.0
Mean scaffold length,33144.4,1.6668e+06,36746.4,13296.8,19038.9,47278.7,45796.0,29366.6,1931650.0,40504.3,...,16612.6,14048.0,109050.0,6577.52,73481.5,13081.2,34710.3,26758.9,81624.1,58406.6


In [246]:
exclude_df.columns.tolist()

['GCA_002495125.1',
 'GCA_002497815.1',
 'GCA_002505735.1',
 'GCA_002502165.1',
 'GCA_002497195.1',
 'GCA_002494945.1',
 'GCA_002506255.1']

In [257]:
# Write file with the paths of the V11 bin and the included genomes
include_file = os.path.join(download_folder, "included_genomes.txt")
with open(include_file, "a") as myfile:
    myfile.write(V11_bin + "\n") # V11
    for inclusion in include_df.columns.tolist():
        filename = os.path.join (download_folder, (inclusion + ".fna"))
        myfile.write(filename + "\n") # Each included genome

# *Thermoplasmata* genomes
For the phylogenomic analyses, I need an outgroup. For this I will use several genomes from organisms of the class Thermoplasmata. I will download their genome sequences and proteomes.

In [18]:
# Create folder and save accessions
Thermoplasmata_folder = os.path.join(work_dir, "genomes", "Thermoplasmata")
if not os.path.exists(Thermoplasmata_folder):
    os.makedirs(Thermoplasmata_folder)
Thermoplasmata_list = os.path.join(Thermoplasmata_folder, "Thermoplasmata.csv")

In [25]:
# I will include RefSeq representative genomes
Thermoplasmata_members = ["GCA_000195915.1", "GCA_900176435.1", 
                          "GCA_000152265.2", "GCA_001402945.1", 
                          "GCA_900090055.1"]

Thermoplasmata_newname = ["GCA_000195915", "GCA_900176435", 
                          "GCA_000152265", "GCA_001402945", 
                          "GCA_900090055"]

In [26]:
# Download genome assemblies in fasta format from NCBI
Thermoplasmata = get_members(Thermoplasmata_members, 
                                Thermoplasmata_folder, 
                                Thermoplasmata_list)

In [27]:
# Uncompress genome files
gunzip_job = "gunzip {0}/*.fna.gz".format(Thermoplasmata_folder)
print(gunzip_job)
!$gunzip_job

gunzip /ebio/abt3_projects/vadinCA11/data/V11/genomes/Thermoplasmata/*.fna.gz


In [28]:
# Fix file names
for f in os.listdir(Thermoplasmata_folder):
    f_new = re.sub('_[a-zA-Z].*_genomic', '', f)
    os.rename(f, f.replace(f, f_new))

In [29]:
# Print table with accessions
Thermoplasmata_df = pd.read_csv(Thermoplasmata_list, sep=',', header=0)
Thermoplasmata_df = Thermoplasmata_df.drop(["Unnamed: 0"], axis=1)
Thermoplasmata_df

Unnamed: 0,Assembly,Organism,Accession
0,ASM19591v1,Thermoplasma acidophilum DSM 1728,GCA_000195915.1
1,IMG-taxon 2579779151 annotated assembly,Picrophilus oshimae DSM 9789,GCA_900176435.1
2,ASM15226v2,Ferroplasma acidarmanus fer1,GCA_000152265.2
3,ASM140294v1,Acidiplasma aeolicum,GCA_001402945.1
4,C.divulgatum PM4,Cuniculiplasma divulgatum,GCA_900090055.1


In [36]:
# Download proteomes
# Download Thermoplasmata prots
Thermoplasma = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/915/GCF_000195915.1_ASM19591v1/GCF_000195915.1_ASM19591v1_protein.faa.gz"
Picrophilus = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/176/435/GCF_900176435.1_IMG-taxon_2579779151_annotated_assembly/GCF_900176435.1_IMG-taxon_2579779151_annotated_assembly_protein.faa.gz"
Ferroplasma = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/152/265/GCF_000152265.2_ASM15226v2/GCF_000152265.2_ASM15226v2_protein.faa.gz"
Acidiplasma = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/402/945/GCF_001402945.1_ASM140294v1/GCF_001402945.1_ASM140294v1_protein.faa.gz"
Cuniculiplasma = "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/090/055/GCF_900090055.1_C.divulgatum_PM4/GCF_900090055.1_C.divulgatum_PM4_protein.faa.gz"

Thermoplasmata_prots = [Thermoplasma, Picrophilus, Ferroplasma, Acidiplasma, Cuniculiplasma]

# download command
download_cmd = "wget {0} --output-document {1}/{3}.faa.gz; gunzip {1}/{3}.faa.gz"
for T in range(0, len(Thermoplasmata_prots)):
    download_job = download_cmd.format(Thermoplasmata_prots[T], 
                                       Thermoplasmata_folder, 
                                       Thermoplasmata_members[T], 
                                       Thermoplasmata_newname[T])
    print(download_job)
    !$download_job

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/915/GCF_000195915.1_ASM19591v1/GCF_000195915.1_ASM19591v1_protein.faa.gz --output-document /ebio/abt3_projects/vadinCA11/data/V11/genomes/Thermoplasmata/GCA_000195915.faa.gz; gunzip /ebio/abt3_projects/vadinCA11/data/V11/genomes/Thermoplasmata/GCA_000195915.faa.gz
--2019-01-25 10:11:20--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/915/GCF_000195915.1_ASM19591v1/GCF_000195915.1_ASM19591v1_protein.faa.gz
           => ‘/ebio/abt3_projects/vadinCA11/data/V11/genomes/Thermoplasmata/GCA_000195915.faa.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 2607:f220:41e:250::11
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/000/195/915/GCF_000195915.1_ASM19591v1 ... done.
==> SIZE GCF_000195915.1_ASM19591v1_protein.faa.gz ... 298504
==> PASV

# Session info

In [37]:
!conda list -n metacompass

# packages in environment at /ebio/abt3_projects/software/miniconda3_gt4.4/envs/metacompass:
#
# Name                    Version                   Build  Channel
aioeasywebdav             2.2.0                    py36_0    conda-forge
aiohttp                   3.4.4            py36h470a237_0    conda-forge
anvio                     5.1.0            py36hcb787e7_1    bioconda
appdirs                   1.4.3                      py_1    conda-forge
aragorn                   1.2.38               h470a237_2    bioconda
asn1crypto                0.24.0                     py_1    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
attrs                     18.1.0                     py_1    conda-forge
automat                   0.7.0                    py36_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
barrnap                   0.9                           2    bioconda
bcftools                  1.9 