# Aim
Repeat earlier analyses like in step4 but:
* without ARP clade
* without spuriouos CDC5 sequence

+ Add arabidopsis mybs to annotate their subfamily as in J&R
+ Annotate the tree with RNA seq data for Azolla filiculoides

All sequences are linear already, so I can start composing the fasta file and aligning right away.

In [1]:
cat data/CDC5-outgroup_sequences_linear.fasta \
    data/I_sequences_linear.fasta    \
    data/II_sequences_linear.fasta   \
    data/III_sequences_linear.fasta  \
    data/IV_sequences_linear.fasta   \
    data/V_sequences_linear.fasta    \
    data/VI_sequences_linear.fasta   \
    data/VII_sequences_linear.fasta  \
    data/VIII_sequences_linear.fasta \
    data/Azfi-mybs-subfamVI-suspects_linear.fasta \
    data/R1R2R3_sequences_linear.fasta \
    data/arabidopsis-myb_sequences.fasta \
    > data/combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear.fasta
    

In [2]:
inseq=combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear

In [3]:
conda activate phylogenetics
if    [ ! -d ./data/alignments_raw/ ]
then  mkdir  ./data/alignments_raw
fi
for   i in data/combi*sequences_linear.fasta
do    if    [ ! -f "./data/alignments_raw/$inseq"_aligned-mafft-einsi.fasta ]
      then  echo "aligning $inseq"
            einsi --thread $(nproc) data/$inseq.fasta > ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.fasta \
                                                      2> ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.log
      fi
done
conda deactivate

(phylogenetics) (phylogenetics) aligning combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear
(phylogenetics) 

In [4]:
conda activate jalview
for   i in data/alignments_raw/*.fasta
do    prefix=$(echo $i | sed 's/\.fasta//')
      if    [ ! -f $prefix.png ]
      then  jalview -nodisplay \
                    -open $prefix.fasta \
                    -colour CLUSTAL \
                    -png  $prefix.png > /dev/null 2> /dev/null
      fi
done
conda deactivate

(jalview) (jalview) 

The linsi alignment looks like this:
![](data/alignments_raw/combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi.png)

That looks quite a bit more sparse than the einsi alignments I have seen before, likely as a consequence of adding all these recent arabidopsis sequences.

In [15]:
conda activate phylogenetics
if    [ ! -d data/alignments_trimmed ]
then  mkdir  data/alignments_trimmed 
fi

# define appendix only once here:
trimappendix='trim-gt5'

for a in "data/alignments_raw/$inseq"_aligned*.fasta
do  appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    if    [ ! -f data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta ]
    then  echo "trimming alignment $a"
          sed -i 's/ /_/g' $a
          trimal -in $a   \
                 -out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta \
                 -gt .5 \
                 -htmlout data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".html &
    fi
done
wait
conda deactivate

(phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) trimming alignment data/alignments_raw/combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi.fasta
[1] 935229
(phylogenetics) [1]+  Done                    trimal -in $a -out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta -gt .5 -htmlout data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".html
(phylogenetics) 

In [16]:
conda activate jalview
for   i in data/alignments_trimmed/*.fasta
do    prefix=$(echo $i | sed 's/\.fasta//')
      if    [ ! -f $prefix.png ]
      then  jalview -nodisplay \
                    -open $prefix.fasta \
                    -colour CLUSTAL \
                    -png  $prefix.png > /dev/null 2> /dev/null &
      fi
done
wait
conda deactivate

(jalview) [1] 935352
(jalview) [1]+  Done                    jalview -nodisplay -open $prefix.fasta -colour CLUSTAL -png $prefix.png > /dev/null 2> /dev/null
(jalview) 

### 60%
![einsi 6 trimmed](data/alignments_trimmed/combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi_trim-gt6.png)

### 50%
![50](data/alignments_trimmed/combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi_trim-gt5.png)

### 40%
![](data/alignments_trimmed/combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi_trim-gt4.png)

Based on past experience, 