## **Project**: Natural products from the Palaeolithic

## **Section**: Phylogenetic analysis of the Butyrolactone BGCs


Anan Ibrahim, 01.05.2022

**Contents**
 - **Step1**: Create conda envirorment with required dependencies if not already installed
 - **Step2**: Phylogenetic network analysis - BiGSCAPE
 - **Step3**: BGC genetic topolgy - CORASON
 - **Step4**: AFSA family phylogenetic tree
 - **Step5**: Alignment of the core genes of the Butyrolactone BGC
 - **Step6**: Butyrolactone cluster synteny - CLINKER

##########

**Step1**: Create conda envirorment with required dependencies if not already installed

##########

In [None]:
# All conda envs can be found in EMN001_Paleofuran/02-scripts/ENVS_*.yml
conda env create -f chlorobiales_genomes1.yml
conda env create -f phylophlan.yml
conda env create -f taxonomic-binning.yml
conda env create -f clinker.yml

##########

**Step2**: Phylogenetic network analysis - BiGSCAPE

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 
# NOTE: Add the ancient Bins/MAGs in $BINS

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS

DREPGENOME=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales/drep/dereplicated_genomes
ANCIENT_CONTIGS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/ancient_contigs_names_butyrolactone

mkdir $OUT

############################
# Bigscape annotation
############################
mkdir $OUT/BIGSCAPE
mkdir $OUT/BIGSCAPE/gbk_regions
mkdir $OUT/BIGSCAPE/genome_gbks

cd $OUT/BIGSCAPE
git clone https://git.wur.nl/medema-group/BiG-SCAPE.git
cd $OUT/BIGSCAPE/BiG-SCAPE

for S in $OUT/ANTISMASH/* ; do
cp $S/*.region001.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region002.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region003.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region004.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region005.gbk $OUT/BIGSCAPE/gbk_regions;
done

for S in $OUT/ANTISMASH-drep-refgenomes/* ; do
cp $S/*.region001.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region002.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region003.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region004.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region005.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region006.gbk $OUT/BIGSCAPE/gbk_regions;
cp $S/*.region007.gbk $OUT/BIGSCAPE/gbk_regions;
done

cp $OUT/PROKKA/*/*.gbk $OUT/BIGSCAPE/genome_gbks
cp $OUT/PROKKA-drep-refgenomes/*/*.gbk $OUT/BIGSCAPE/genome_gbks

mkdir $OUT/BIGSCAPE/BUTYR-1
python bigscape.py --label chlorobiales-drep-butyrolactone --inputdir $OUT/BIGSCAPE/gbk_regions \
--outputdir $OUT/BIGSCAPE/BUTYR-1 \
--pfam_dir /Net/Groups/ccdata/databases/pfam34/ \
--force_hmmscan --mode auto --cutoffs 1.0 --clan_cutoff 1.0 1.0 --include_singletons \
--query_bgc $OUT/ANTISMASH/EMN001_021/c00006_EMN001_...region001.gbk \
--cores 29 --verbose > $OUT/BIGSCAPE/BUTYR-1/run.log


##########

**Step3**: BGC genetic topolgy - CORASON

##########

*Note*: Requires Docker installation

In [None]:
#1 mkdir ~/bin
#2 pull the docker container: docker pull nselem/corason:latest
#3 hcange the windows path in the script : /home/aibrahim/bin
#4 download the script: curl -q https://raw.githubusercontent.com/nselem/corason/master/run_corason > ~/bin/run_corason
#5 execute the script: chmod a+x ~/bin/run_corason
#6 check if its running: ~/bin/run_corason

# The <EMN001_AFSA.faa> is the aminoacid sequence of teh AFSA
# The <genome_gbks/EMN001_021.gbk> are the gbk files of the ref modern genomes and the ancient MAGs generated by PROKKA
sudo ./run_corason EMN001_AFSA.faa genome_gbks genome_gbks/EMN001_021.gbk -g 10

##########

**Step4**: AFSA family phylogenetic tree

##########

*Note*: Manually download all AFSA family proteins (same domain architecture) from InterPro http://www.ebi.ac.uk/interpro/entry/InterPro/IPR005509/protein/UniProt/#table and add them to (AFSAFAMILY)

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS
SCRIPT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Scripts

AFSAFAMILY=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/Interpro_A0A101JTJ5
ROOT_AFSA=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/AFSA_outgroup
ANCIENT_AFSA=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/AFSA_aa

# Samples:
EMN=$BINS/EMN001_021.fna
GOY6=$BINS/GOY006_RA.fna
GOY5=$BINS/GOY005_001.fna
PES=$BINS/PES001_018.fna
RIG=$BINS/RIG001_014.fna
PLV18=$BINS/PLV001_001.fna
PLV20=$BINS/PLV001_002.fna
TAF=$BINS/TAF017_RA.fna
ROOT=$BINS/JAGYWA010000017.fna

############################
#Cluster the AFSA protein family
############################
# Cluster the 742 AFSA sequences according to 0.5% identity for an accurate phylogenetic tree
mkdir $OUT/AFSA_family_cluster
mkdir $OUT/AFSA_family_cluster/mmseqs/
mkdir $OUT/AFSA_family_cluster/mmseqs/tmp

cd $OUT/AFSA_family_cluster/mmseqs

# Replace space with underscores
sed 's/ /_/g' -i $AFSAFAMILY/*.faa

# MMSeqs2 = Cluster the 742 protein sequences
eval "$(conda shell.bash hook)"
conda activate taxonomic-binning

# create db
mmseqs createdb $AFSAFAMILY/*.faa afsa_database

# cluster the db -s 7 (sensitivty)
mmseqs cluster afsa_database afsa_database_clu tmp --cluster-mode 0 --min-seq-id 0.5 --threads 30

# create a table version
mmseqs createtsv afsa_database afsa_database afsa_database_clu afsa_database_clu.tsv

# create a fasta version
mmseqs createseqfiledb afsa_database afsa_database_clu afsa_database_clu_seq 
mmseqs result2flat afsa_database afsa_database afsa_database_clu_seq afsa_database_clu_seq.fasta

# extract the representative of a clustering 
mmseqs createsubdb afsa_database_clu afsa_database afsa_database_clu_rep
mmseqs convert2fasta afsa_database_clu_rep afsa_database_clu_rep.fasta   

echo 'mod0-0.5' | awk -F '\t' '{print $1}' *.tsv | sort | uniq -c | wc -l | cat

conda deactivate

############################
# Muscle and CLustalo Alignment
############################
mkdir $OUT/AFSA_family_cluster/muscle

eval "$(conda shell.bash hook)"
conda activate phylophlan

# Add ancient AFSA genes and the cluster representatives in a file
cat $ANCIENT_AFSA/*.faa $ROOT_AFSA/*.faa $OUT/AFSA_family_cluster/mmseqs/afsa_database_clu_rep.fasta > $OUT/AFSA_family_cluster/muscle/cluster.fasta

muscle -in $OUT/AFSA_family_cluster/muscle/cluster.fasta \
-phyiout $OUT/AFSA_family_cluster/muscle/alignment.phy \
-clwstrictout $OUT/AFSA_family_cluster/muscle/alignment.clw \
-out $OUT/AFSA_family_cluster/muscle/alignment.afa -maxiters 5

muscle -in $OUT/AFSA_family_cluster/muscle/cluster.fasta \
-out $OUT/AFSA_family_cluster/muscle/alignment.afa -maxiters 5

clustalo -i $OUT/AFSA_family_cluster/muscle/alignment.clw \
--percent-id --distmat-out=$OUT/AFSA_family_cluster/muscle/pim.txt --full --force

conda deactivate 

############################
# RAXML tree construction
############################
mkdir $OUT/AFSA_family_cluster/tree

eval "$(conda shell.bash hook)"
conda activate phylophlan

cp $OUT/AFSA_family_cluster/muscle/alignment.phy $OUT/AFSA_family_cluster/tree/
cd $OUT/AFSA_family_cluster/tree

raxmlHPC-PTHREADS-AVX2 -f a -x 12345 -p 12345 -# 50 -m PROTGAMMAAUTO -T 29 -s alignment.phy -n alignment.tree 

conda deactivate


##########

**Step5**: Alignment of the core genes of the Butyrolactone BGC

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output
BINS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BINS
SCRIPT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Scripts

DREPGENOME=/Net/Groups/ccdata/databases/ncbi-ref-genomes/Chlorobiales/drep/dereplicated_genomes
AFSAFAMILY=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/Interpro_A0A101JTJ5
ROOT_AFSA=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/AFSA_outgroup/*.faa
ANCIENT_AFSA_comb=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/AFSA_aa.faa
EMN_AFSA=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/AFSA_aa/EMN001_AFSA.faa
ANCIENT_CONTIGS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/ancient_contigs_names_butyrolactone
BGC_GENES=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Input/BGC_genes

############################
# Grab percent identities of AFSA/OXI/SKI genes in Butyrolactone and terpene core protein
############################

mkdir $OUT/BGC_GENES
cd $OUT/BGC_GENES

# Add ancient AFSA genes and the cluster representatives in a file
for F in $BGC_GENES/* ; do
N=$(basename $F ) ;
cat $F/*.fasta.fna > all_$N.nt.fasta ;
cat $F/*.fasta.faa > all_$N.aa.fasta ; 
done

eval "$(conda shell.bash hook)"
conda activate /home/aibrahim/anaconda3/envs/phylogenetics

for F in $OUT/BGC_GENES/*.fasta ; do
N=$(basename $F .fasta ) ;
muscle -in $F \
-phyiout $OUT/BGC_GENES/$N.alignment.phy \
-clwstrictout $OUT/BGC_GENES/$N.alignment.clw \
-out $OUT/BGC_GENES/$N.alignment.afa -maxiters 10 ;
muscle -in $F \
-out $OUT/BGC_GENES/$N.alignment.afa -maxiters 10 ;
clustalo -i $OUT/BGC_GENES/$N.alignment.clw --percent-id --distmat-out=$OUT/BGC_GENES/$N.pim.txt --full --force ;
sed -i '1d' $OUT/BGC_GENES/$N.pim.txt;
done

conda deactivate 

##########

**Step6**: Butyrolactone cluster synteny - CLINKER

##########

In [None]:
#!/bin/bash

############################
#Hashes and Directories
############################

# NOTE: Change directories in bash script accordingly 

# Directories: 
OUT=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output

BUT_CONTIGS=/Net/Groups/ccdata/users/AIbrahim/ancientDNA/Deep-Evo/BGC/final-butyrolactone/Output/CLINKER/butyrolactone_contigs

mkdir $OUT

############################
# Grab the Butyrolactone BGC containing `.gbk` '$BUT_CONTIGS' files and run CLINKER
############################
mkdir $OUT/CLINKER/butyrolactone
cd $OUT/CLINKER/butyrolactone

eval "$(conda shell.bash hook)"
conda activate clinker 

# CLINKER
clinker $BUT_CONTIGS/*.gbk -p butyrolactone.html -j 30 -o butyrolactone.aln

conda deactivate

# Manually: adjust the synteny image manually using the .html