## **Project**: Natural products from the Palaeolithic

## **Section**: Phylogenetic analysis of the dereplicated Chlorobiales genomes


Anan Ibrahim, 01.05.2022

**Contents**
 - **Step1**: Create conda envirorment with required dependencies if not already installed
 - **Step2**: Dereplicate the Chlorobiales genomes
 - **Step3**: Construct the phylogenetic tree : Chlorobium limicola (aminoacid level)
 - **Step4**: Construct the phylogenetic tree : Chlorobaculum parvum (aminoacid level)
 - **Step5**: Construct the phylogenetic tree : Chlorobium limicola (nucleotide level)

##########

**Step1**: Create conda envirorment with required dependencies if not already installed

##########

In [None]:
# All conda envs can be found in EMN001_Paleofuran/02-scripts/ENVS_*.yml
conda env create -f drep.yml
conda env create -f phylophlan.yml
conda env create -f phyloplan_nucleotide.yml

##########

**Step2**: Dereplicate the Chlorobiales genomes 

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins/MAGs in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales
DREPGENOME=//Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales/drep/dereplicated_genomes

# Ancient samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
#Drep = dreplicate genomes
############################
mkdir $CHLOROBIALES/drep

eval "$(conda shell.bash hook)"
conda activate drep

dRep dereplicate $CHLOROBIALES/drep \
            -p 25 \
            --ignoreGenomeQuality \
            -g $CHLOROBIALES/fasta/*.fna \
            --S_algorithm ANImf \
            -pa 0.90 -sa 0.99

cp $BINS/* $DREPGENOME/
cp $ROOT $DREPGENOME/

conda deactivate

##########

**Step3**: Construct the phylogenetic tree : Chlorobium limicola (aminoacid level)

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins/MAGs in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales
DREPGENOME=//Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales/drep/dereplicated_genomes

# Ancient samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
# PHYLOPHLAN-chlorobium limicola
############################

mkdir $OUT/TREE_drep_climicola

eval "$(conda shell.bash hook)"
conda activate phylophlan

cd $OUT/TREE_drep_climicola

# Make config file : allignment based on translated sequences : using mafft and diamond. If nucleotide allignents needed use  --force_nucleotides here and in phylophlan command.
phylophlan_write_config_file \
    -d a \
    -o  $OUT/TREE_drep_climicola/config.cfg \
    --db_aa diamond \
    --map_dna diamond \
    --map_aa diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml
    
# Set up database (can either just use phylophlan database or specifically chlorobium limicola. 
# Here chlorobium was used because otherwise the query seqs will be removed due to not fullfilling the hmm models created to retain the sequences.
mkdir $OUT/TREE_drep_climicola/core_genes/

phylophlan_setup_database \
    -g s__Chlorobium_limicola --database_update \
    -o $OUT/TREE_drep_climicola/core_genes/ \
    --verbose
    
## Use this flag when phylophlan run crashes  --clean_all 
phylophlan \
-i $DREPGENOME \
-d $OUT/TREE_drep_climicola/core_genes/s__Chlorobium_limicola \
-o $OUT/TREE_drep_climicola/ \
--diversity medium \
--accurate \
-f $OUT/TREE_drep_climicola/config.cfg \
--nproc 28 --min_num_markers 1 \
--verbose > $OUT/TREE_drep_climicola/phylophlan_output.log

############################
# Phylophlan alternative with bootstrapping
############################

cd $OUT/TREE_drep_climicola/
raxmlHPC-PTHREADS-SSE3 -m PROTCATLG -f a -x 12345 -p 12345 -# 100 -T 28 -w $OUT/TREE_drep_climicola/ -s $OUT/TREE_drep_climicola/*.aln -n climicola_refined_bootstrap.tre

conda deactivate

##########

**Step4**: Construct the phylogenetic tree : Chlorobaculum parvum (aminoacid level)

##########

*NOTE*: Manually download all chlorobaculum parvum core proteins from Uniref90 https://www.uniprot.org/uniref/?query=taxonomy%3A%22Chlorobaculum+parvum+%5B274539%5D%22+identity%3A0.9&sort=score and add them to (CORECPARVUM)


In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins/MAGs in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales
DREPGENOME=//Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales/drep/dereplicated_genomes
CORECPARVUM=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/Uniref90/chlorobaculum_parvum

# Ancient samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
# PHYLOPHLAN-chlorobaculum parvum
############################
# NOTE: Manually download all chlorobaculum parvum core proteins from Uniref90 https://www.uniprot.org/uniref/?query=taxonomy%3A%22Chlorobaculum+parvum+%5B274539%5D%22+identity%3A0.9&sort=score and add them to (CORECPARVUM)

mkdir $OUT/TREE_drep_cparvum

eval "$(conda shell.bash hook)"
conda activate phylophlan

cd $OUT/TREE_drep_cparvum

# Make config file : allignment based on translated sequences : using mafft and diamond. If nucleotide allignents needed use  --force_nucleotides here and in phylophlan command.
phylophlan_write_config_file \
    -d a \
    -o  $OUT/TREE_drep_cparvum/config.cfg \
    --db_aa diamond \
    --map_dna diamond \
    --map_aa diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml
    
# Set up custom database
# Manually download all chlorobaculum parvum core proteeins from Uniref90 https://www.uniprot.org/uniref/?query=taxonomy%3A%22Chlorobaculum+parvum+%5B274539%5D%22+identity%3A0.9&sort=score
# Add them to : /Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/Uniref90/chlorobaculum_parvum

mkdir $OUT/TREE_drep_cparvum/core_genes/
phylophlan_setup_database \
    -i $CORECPARVUM/ \
    -d Uniref90_chlorobaculum_parvum \
    -e faa \
    -t a --overwrite -o $OUT/TREE_drep_cparvum/core_genes --verbose
    
## Use this flag when phylophlan run crashes  --clean_all  also use --databases_folder $OUT/core_genes/ \
phylophlan \
-i $DREPGENOME \
-d $OUT/TREE_drep_cparvum/core_genes/Uniref90_chlorobaculum_parvum \
-o $OUT/TREE_drep_cparvum/ \
--diversity medium \
--accurate \
-f $OUT/TREE_drep_cparvum/config.cfg \
--nproc 30 --min_num_markers 1 \
--verbose > $OUT/TREE_drep_cparvum/phylophlan_output.log

############################
# Phylophlan alternative with bootstrapping
############################
cd $OUT/TREE_drep_cparvum/
raxmlHPC-PTHREADS-SSE3 -m PROTCATLG -f a -x 12345 -p 12345 -# 100 -T 28 -w $OUT/TREE_drep_cparvum/ -s $OUT/TREE_drep_cparvum/*.aln -n cparvum_refined_bootstrap.tre

conda deactivate

##########

**Step5**: Construct the phylogenetic tree : Chlorobium limicola (nucleotide level)

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins/MAGs in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

CHLOROBIALES=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales
DREPGENOME=//Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales/drep/dereplicated_genomes

# Ancient samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

mkdir $OUT

############################
# PHYLOPHLAN-chlorobium limicola-drep
############################
mkdir $OUT/TREE_drep_climicola-nt

eval "$(conda shell.bash hook)"
conda activate phyloplan_nucleotide

cd $OUT/TREE_drep_climicola-nt

# Make config file : allignment based on translated sequences : using mafft and diamond. If nucleotide allignents needed use  --force_nucleotides here and in phylophlan command.
phylophlan_write_config_file \
    -d n \
    -o $OUT/TREE_drep_climicola-nt/config.cfg \
    --db_dna makeblastdb \
    --map_dna blastn \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml    
    
# Set up database (can either just use phylophlan database or specifically chlorobium limicola. 
# Here chlorobium was used because otherwise the query seqs will be removed due to not fullfilling the hmm models created to retain the sequences.
mkdir $OUT/TREE_drep_climicola-nt/core_genes/

phylophlan_setup_database \
    -g s__Chlorobium_limicola -t n --database_update \
    -o $OUT/TREE_drep_climicola-nt/core_genes/ \
    --verbose
    
## Use this flag when phylophlan run crashes  --clean_all 
phylophlan \
-i $DREPGENOME \
-d $OUT/TREE_drep_climicola-nt/core_genes/s__Chlorobium_limicola \
-o $OUT/TREE_drep_climicola-nt/ \
--diversity medium \
--accurate \
-f $OUT/TREE_drep_climicola-nt/config.cfg \
--nproc 28 --min_num_markers 1 \
--verbose > $OUT/TREE_drep_climicola-nt/phylophlan_output.log

#THE ROOT was not incoportaed into the tree because it did not contain any similar core genes as the chlorobiales genomes

############################
# Phylophlan alternative with bootstrapping
############################
cd $OUT/TREE_drep_climicola-nt/
raxmlHPC-PTHREADS-SSE3 -m GTRCAT -f a -x 12345 -p 12345 -# 100 -T 28 -w $OUT/TREE_drep_climicola-nt/ -s $OUT/TREE_drep_climicola-nt/*.aln -n climicola_nt_refined_bootstrap.tre

conda deactivate