## **Project**: Natural products from the Palaeolithic

## **Section**: Download the entire Chlorobiales genomes and construct a phylogenetic tree


Anan Ibrahim, 01.05.2022

**Contents**
 - **Step1**: Create conda envirorment with required dependencies if not already installed 
 - **Step2**: Download Chlorobiales genomes from the Assembly database
 - **Step3**: Construct entire Chlorobiales order + Ancient MAGs phylogenetic tree
 - **Step4**: Prune the tree to only include the clade with ancient MAGs
 - **Step5**: Calculate the ANI values

##########

**Step1**: Create conda envirorment with required dependencies if not already installed 

##########

In [None]:
# All conda envs can be found in EMN001_Paleofuran/02-scripts/ENVS_*.yml
conda env create -f chlorobiales_genomes1.yml
conda env create -f chlorobiales_genomes2.yml
conda env create -f phylophlan.yml
conda env create -f fastani.yml

##########

**Step2**: Download Chlorobiales genomes from the Assembly database

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins/MAGs in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales

# Ancient samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna

mkdir $OUT

############################
#Download from Assembly (NCBI)
############################
mkdir $CHLOROBIALES
mkdir $CHLOROBIALES/fasta

eval "$(conda shell.bash hook)"
conda activate chlorobiales_genomes1

cd $CHLOROBIALES

# Grab the search query from the Assembly database on ncbi first  
esearch -db assembly -query '"Chlorobiales"[Organism] AND (latest[filter] AND all[filter] NOT anomalous[filter])' \
| esummary | xtract -pattern DocumentSummary -element AssemblyAccession > chlorobiales_assembly-accs.txt
#Result: 599 seq. 
conda deactivate

conda activate chlorobiales_genomes2

# Download the fasta files corresponnding to the accessions
# -j parallelism 
cd $CHLOROBIALES/fasta

bit-dl-ncbi-assemblies -w $CHLOROBIALES/chlorobiales_assembly-accs.txt -f fasta -j 20
# Result: 596 genomes downloaded (the rest not found)
gzip -d *.gz 

conda deactivate 

cp $BINS/* $CHLOROBIALES/fasta
cp $ROOT $CHLOROBIALES/fasta

##########

**Step3**: Construct entire Chlorobiales order + Ancient MAGs phylogenetic tree

##########

*Note:* Do not forget to add the outgroup in this group we downloaded Flavobacterium branchiicola seperately from NCBI:@https://www.ncbi.nlm.nih.gov/nuccore/NZ_JAGYWA010000017

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales

# Samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
# PHYLOPHLAN-chlorobium limicola-all 596 genomes
############################
mkdir $OUT/TREE_chlorobiales_genomes

eval "$(conda shell.bash hook)"
conda activate phylophlan

cd $OUT/TREE_chlorobiales_genomes

# Make config file : allignment based on translated sequences : using mafft and diamond. If nucleotide allignents needed use  --force_nucleotides here and in phylophlan command.
phylophlan_write_config_file \
    -d a \
    -o $OUT/TREE_chlorobiales_genomes/config.cfg \
    --db_aa diamond \
    --map_dna diamond \
    --map_aa diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml
    
# Set up database (can either just use phylophlan database or specifically chlorobium limicola. 
# Here chlorobium was used because otherwise the query seqs will be removed due to not fullfilling the hmm models created to retain the sequences.
mkdir $OUT/TREE_chlorobiales_genomes/core_genes/

phylophlan_setup_database \
    -g s__Chlorobium_limicola --database_update \
    -o $OUT/TREE_chlorobiales_genomes/core_genes/ \
    --verbose
    
## Use this flag when phylophlan run crashes  --clean_all 
phylophlan \
-i $CHLOROBIALES/fasta \
-d $OUT/TREE_chlorobiales_genomes/core_genes/s__Chlorobium_limicola \
-o $OUT/TREE_chlorobiales_genomes/ \
--diversity medium \
--accurate \
-f $OUT/TREE_chlorobiales_genomes/config.cfg \
--nproc 30 --min_num_markers 1 \
--verbose > $OUT/TREE_chlorobiales_genomes/phylophlan_output.log

# Bootstrap the tree
cd $OUT/TREE_chlorobiales_genomes/
raxmlHPC-PTHREADS-SSE3 -s fasta_concatenated.aln -n bootstrap_tree.tre -f a -m PROTGAMMAAUTO -N 100 -x 12345 -p 12345 -T 30

conda deactivate

##############################################
# ANNOTATE THE TREE
##############################################
eval "$(conda shell.bash hook)"
conda activate chlorobiales_genomes1

cd $OUT/TREE_chlorobiales_genomes

# Grab the assembly accession for annotation
ls $CHLOROBIALES/fasta > chlorobiales_assembly-accs.txt
# ls /Net/Groups/ccdata/databases/ncbi-ref-genomes/pseudomonas_syringae/fasta > ass_acc.txt
sed -i 's/ /_/g'  chlorobiales_assembly-accs.txt
sed -i 's/.fna//g' chlorobiales_assembly-accs.txt

file=chlorobiales_assembly-accs.txt
lines=$(cat $file)

for line in $lines
do
esearch -db assembly -query "$line" |esummary | \
xtract -pattern DocumentSummary \
-sep "\t" -element AssemblyAccession,Taxid,AssemblyName,Organism,assembly-status >> ass_acc_2.txt
done

# Create a iTOL accession to lineage dataset:
# Print the acc. list
awk -F '\t' '{print $1"\t"$4}' \
ass_acc_2.txt > ass_acc_itol2.txt

# Manually :
# Using this file just open the constructed tree in Itol and drag this file in it to label the branches.
 

##########

**Step4**: Prune the tree to only include the clade with ancient MAGs

##########

*Note:* Do not forget to add the outgroup in this group we downloaded Flavobacterium branchiicola (NZ_JAGYWA010000017) seperately from NCBI

*Note:* Visualise the entire tree in the ITOL desktop app, and then grab the acc numbers of the pruned tree that contains all genomes classified upto a species level to a text file.

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales

# Samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
# Ancient Clade: pruned assembly accs and tree
############################

mkdir $OUT/TREE_chlorobiales_genomes/pruned2

# MANUALLY: from itol a text file with the accessions (of spp identified upto a sp. level is created == PHYL_chlorobiales_genomes_annotated_to_spp.txt)
# Copy the pruned accessions to another folder and add underscores and .fna to the end of the files 
sed -i 's/ /_/g' $OUT/TREE_chlorobiales_genomes/pruned2/PHYL_chlorobiales_genomes_annotated_to_spp.txt
sed -i "s|$|.fna|" $OUT/TREE_chlorobiales_genomes/pruned2/PHYL_chlorobiales_genomes_annotated_to_spp.txt

# copy the 31 genomes to another input directory
cd $CHLOROBIALES/fasta
for file in $(cat $OUT/TREE_chlorobiales_genomes/pruned2/PHYL_chlorobiales_genomes_annotated_to_spp.txt); do 
cp "$file" $OUT/TREE_chlorobiales_genomes/pruned2/; done

# Make a tree
eval "$(conda shell.bash hook)"
conda activate phylophlan

cd $OUT/TREE_chlorobiales_genomes/pruned2

# Make config file : allignment based on translated sequences : using mafft and diamond. If nucleotide allignents needed use  --force_nucleotides here and in phylophlan command.
phylophlan_write_config_file \
    -d a \
    -o $OUT/TREE_chlorobiales_genomes/pruned2/config.cfg \
    --db_aa diamond \
    --map_dna diamond \
    --map_aa diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml
    
# Set up database (can either just use phylophlan database or specifically chlorobium limicola. 
# Here chlorobium was used because otherwise the query seqs will be removed due to not fullfilling the hmm models created to retain the sequences.
phylophlan_setup_database \
    -g s__Chlorobium_limicola --database_update \
    -o $OUT/TREE_chlorobiales_genomes/core_genes/ \
    --verbose
    
## Use this flag when phylophlan run crashes  --clean_all 
phylophlan \
-i $OUT/TREE_chlorobiales_genomes/pruned2 \
-d $OUT/TREE_chlorobiales_genomes/core_genes/s__Chlorobium_limicola \
-o $OUT/TREE_chlorobiales_genomes/pruned2/ \
--diversity low \
--accurate \
-f $OUT/TREE_chlorobiales_genomes/pruned2/config.cfg \
--nproc 30 --min_num_markers 1 \
--verbose > $OUT/TREE_chlorobiales_genomes/pruned2/phylophlan_output.log

############################
# Phylophlan alternative with bootstrapping 
############################
cd $OUT/TREE_chlorobiales_genomes/pruned2
raxmlHPC-PTHREADS-SSE3 -m PROTCATLG -f a -x 12345 -p 12345 -# 100 -T 28 -w $OUT/TREE_chlorobiales_genomes/pruned2 -s $OUT/TREE_chlorobiales_genomes/pruned2/*.aln -n pruned2_refined_bootstrap.tre


##########

**Step5**: Calculate the ANI values 

##########

*Note:* Do not forget to add the outgroup in this group we downloaded Flavobacterium branchiicola (NZ_JAGYWA010000017) seperately from NCBI

*Note:* Visualise the entire tree in the ITOL desktop app, and then grab the acc numbers of the pruned tree that contains all genomes classified upto a species level to a text file.

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales

# Samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
# Ancient Clade: add the ANI values
############################

eval "$(conda shell.bash hook)"
conda activate fastani

# Calculate ANI
cd $OUT/TREE_chlorobiales_genomes/pruned2
# conda create -n fastani -c conda-forge -c bioconda fastani

# add paths to every line
#cp $OUT/TREE_chlorobiales_genomes/pruned2/accessions_IDs_annotated_to_species_level.txt $OUT/TREE_chlorobiales_genomes/pruned2/acc-list-paths.txt 
#awk '$0="$OUT/TREE_chlorobiales_genomes/pruned2/"$0' $OUT/TREE_chlorobiales_genomes/pruned2/acc-list-paths.txt > acc-list-paths-written.txt

#fragLen 3000
fastANI --ql accessions_IDs_annotated_to_species_level.txt --rl accessions_IDs_annotated_to_species_level.txt -t 8 -k 16 --fragLen 3000 -o ANI_output_fg3000.txt --matrix
#For heatmap ANI estimate figure wo filtering 
awk -F '\t' '{print $1"\t"$2"\t"$3}' ANI_output_fg3000.txt > ANI_output1_fg3000.txt

awk 'BEGIN {FS=OFS="\t"} 
           {col[$1]; row[$2]; val[$2,$1]=$3}
     END   {for(c in col) printf "%s", OFS c; print "";
            for(r in row)
              {printf "%s", r;
               for(c in col) printf "%s", OFS val[r,c]
               print ""}}' ANI_output1_fg3000.txt > ANI_OUT_MATRIX_fg3000.txt

#fragLen 1000
fastANI --ql accessions_IDs_annotated_to_species_level.txt --rl accessions_IDs_annotated_to_species_level.txt -t 8 -k 16 --fragLen 1000 -o ANI_output_fg1000.txt --matrix
#For heatmap ANI estimate figure wo filtering 
awk -F '\t' '{print $1"\t"$2"\t"$3}' ANI_output_fg1000.txt > ANI_output1_fg1000.txt
awk 'BEGIN {FS=OFS="\t"} 
           {col[$1]; row[$2]; val[$2,$1]=$3}
     END   {for(c in col) printf "%s", OFS c; print "";
            for(r in row)
              {printf "%s", r;
               for(c in col) printf "%s", OFS val[r,c]
               print ""}}' ANI_output1_fg1000.txt > ANI_OUT_MATRIX_fg1000.txt

#For heatmap ANI estimate figure w filtering : remove all mappings that had a mapping percentage of less than 70%
# fragLen 3000
awk -F '\t' '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$4/$5}' ANI_output_fg3000.txt > ANI_output2_fg3000.txt
awk '$6 >= 0.60 { print }' ANI_output2_fg3000.txt > ANI_output2_fg3000_filter.txt
awk '{print $1"\t"$2"\t"$3}' ANI_output2_fg3000_filter.txt > ANI_output21_fg3000_filter.txt
awk 'BEGIN {FS=OFS="\t"} 
           {col[$1]; row[$2]; val[$2,$1]=$3}
     END   {for(c in col) printf "%s", OFS c; print "";
            for(r in row)
              {printf "%s", r;
               for(c in col) printf "%s", OFS val[r,c]
               print ""}}' ANI_output21_fg3000_filter.txt > ANI_OUT_MATRIX_fg3000_filter.txt
# fragLen 1000
awk -F '\t' '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$4/$5}' ANI_output_fg1000.txt > ANI_output2_fg1000.txt
awk '$6 >= 0.60 { print }' ANI_output2_fg1000.txt > ANI_output2_fg1000_filter.txt
awk '{print $1"\t"$2"\t"$3}' ANI_output2_fg1000_filter.txt > ANI_output21_fg1000_filter.txt
awk 'BEGIN {FS=OFS="\t"} 
           {col[$1]; row[$2]; val[$2,$1]=$3}
     END   {for(c in col) printf "%s", OFS c; print "";
            for(r in row)
              {printf "%s", r;
               for(c in col) printf "%s", OFS val[r,c]
               print ""}}' ANI_output21_fg1000_filter.txt > ANI_OUT_MATRIX_fg1000_filter.txt

############################
# Ancient Clade: add the source of bacteria
############################
# This was done manually using the NCBI and checking the publications individually