# The Plan

0. Make sure you have all tools installed. mafft, trimal, iqtree at the very least. 
   - use `conda env create -f ./conda_environment.yaml` to create this environment and re-open jupyter inside the environment if you haven't already.
1. Get a set of sequences to build a tree of.
   - for example from the 1kP project, a paper you found, or from blast.
   - Subset sequences if there are too many.
   - Do you have an outgroup to root your tree on? (unless you won't root your tree)
   - Do you have trusted or verified sequences to make sense of the different clades you may get in your tree?
2. Align these to each other.
   - using mafft or another aligner like clustalw, tcoffee or prank.
3. Trim the alignment, removing gaps.
   - with trimAL, optimise trimming to both make your tree building faster and more reliable.
4. Build a fast phylogentic tree.
   - with `fasttree`, or with `iqtree --fast`
5. Build a thorough phylogenetic tree.
   - We combine substitution model fitting and tree building in IQtree.
6. Visualise and share your tree
   - [iToL](http://www.iqtree.org/doc/Frequently-Asked-Questions#how-do-i-interpret-ultrafast-bootstrap-ufboot-support-values)
   
## Annotate and log

A jupy notebook like this, is your labjournal for doing research on the computer. In here you keep a record of 
 - what files you use
 - how you made new files
 - where you stored these
 - etcetera.
 
You do this just by writing the code and keeping the output saved in here. However, one thing is not kept automatically, and that is the choices you make. Hence for transparent and propper science, it is vital that you make this notebook your own, by writing all observations, desicions etcetera in here.

![image.png](attachment:image.png)


Describe how you got the sequences you're making a tree of, why you got those sequences there, and what question you are trying to answer by making this tree.

Erbils ANS sequences, two times (supposedly) the same orthogroup from the 1KP project extracted via two different established sequences.

Second, some Azolla filiculoides sequences we'd like to place in context, and some guide sequencinges selected by Erbil.

# 1. Composing your fasta

Let's look at what we have

In [2]:
tree

.
├── analyses
│   ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_fasttrees
│   │   └── aligned-mafft_trim-gt4-seq95-res90
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.bionj
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.ckp.gz
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.iqtree
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.log
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.stderr
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.stdout
│   │       ├── orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90_iqtree-fast.svg
│   │       ├── orthogroup_AtLDOX_AT4

Store the sequences you want to make a tree of in the data directory and make the inseq variable the name of your input fasta without the extention:

In [3]:
inseq=orthogroup_ANS-LDOX

Let's look at the first ten lines of your fasta to double-check.

In [28]:
head data/$inseq.fasta

>Orysa_v7_0-LOC_Os06g42130_1-Oryza_sativa_VvANS-like
MTDVELRVEALSLSDVSAIPPEYVRLEEERTDLGDALEVARAASDDADAARIPVVDISAFDGDGRRACVEAVRAAAEEWGVMHIAGHGLPGDVLDRLRAAGEAFFALPIAEKEAYANDPAAGRLQGYGSKLAANASGKREWEDYLFHLVHPDHLADHSLWPANPPEYVPVSRDFGGRVRTLASKLLAILSLGLGLPEETLERRLRRHDQHGVDDDLLLQLKINYYPRCPRPDLAVGVEAHTDVSALSFILHNGVPGLQAHHAGTWVTARSEQGTIVVHVGDALEILTNGRYTSVLHRSLVSRDAVRVSWVVFCEPPPESVLLQPLPELLANGAGKPLFAPRTFKQHVQRKLFKKLKDQQDNNAAAASNGIIPK
>NHUA-2070407-Castanea_crenata_VvANS-like
MSPAMVMAAKTKQNDNLESQYQKGVKHLCENGINKVPKKYILPVSDRPNMKDRELINVKKQNLKLPIIDFAELQGSNRPQVLKSLANACEQYGFFQLVNHGIPSDVISSMIDVSTRFFELPFEERAKYMSSDMQSPVRCG
>XKPS-2011846-Ximenia_americana_VvANS-like
REVTEEYNKEILRVTYKLLEVMSEGLGLEEKVLESVLGGKDIEVEMKINMYPPCPQPELALGVEPHTDMSALTILVSNDVPGLQVFKDNNWVSVHYLPDALFVHVGDQIEVLSNGKYKSVLHRSLVNKERMRMSWAVFCTPPHEAMIGPLPKLVDDENSAKYSTKTYAEYRHRKFNKLPQ
>JETM-2084677-Bauhinia_tomentosa_VvANS-like
MAEDNGLGWGKSLPAPNVQEMVRNDPQTVPDRYIRDYKDRPVIYDQSPDSSEIQVIDFSLLFKRDEDEIKKLDFACKEWGFFQLVNHGVAEELIYKMKESEKGFFDLPYEEKKKYA
>EGLZ

Check the nr. of sequences in your fasta file:

In [4]:
grep '>' data/$inseq.fasta -c

92266


In [5]:
grep '>' data/$inseq.fasta | sort | uniq --count | cut -c 1-7 | sort | uniq --count

  92266       1


In [6]:
grep -v '>' data/$inseq.fasta | sort | uniq --count | cut -c 1-7 | sort | uniq --count

  88665       1
   1597       2
    116       3
     12       4
      2       5


As initially expected there is a huge amount of double sequences. Only one sequence is present once! Hence, I'll continue with the orthogroup based on _Arabidopsis thaliana_ LDOX.

In [7]:
inseq=orthogroup_AtLDOX_AT4g22880

## 1.2 systematic subsetting.

If your data set has some systamic naming of sequences, or of species/genus names, it would be good to subset based on those. How to do this, completely depends on the formatting of your specific dataset. Here I examplify this for the 1kP dataset.

The 1kP dataset works with 4 letter codes for RNA assembled transcripts, and with 5 letter codes for transcriptes derived from assembled genomes. The former can be browsed [here](http://www.onekp.com/samples/list.php).

The nice thing about the 1kP dataset, is that it has plants categorised in clades, families and species. You can use this data to evenly select species from both early and late branching clades and get a nice overview of plant evolution.

I propose to make a file with the 'search criteria' that your sequences need to match. That could be species names, certain codes, etcetera. Next, use grep to get all sequence names that match those search criteria, and then extract the sequences.

Finally, I'm just making sure that all sequences names are unique.

Now let's get a list of 1kP identifiers used in the LAR tree also made for Erbil:

In [14]:
cd data
wget https://raw.githubusercontent.com/lauralwd/LAR_phylogeny_gungor-et-al-2020/main/data/1kP_LAR_selectionv1_guide_v5.fasta
cd ..

--2021-02-02 17:22:38--  https://raw.githubusercontent.com/lauralwd/LAR_phylogeny_gungor-et-al-2020/main/data/1kP_LAR_selectionv1_guide_v5.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.36.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 241877 (236K) [text/plain]
Saving to: ‘1kP_LAR_selectionv1_guide_v5.fasta’


2021-02-02 17:22:39 (26,0 MB/s) - ‘1kP_LAR_selectionv1_guide_v5.fasta’ saved [241877/241877]



get the RNA ids:

In [42]:
grep '>' data/1kP_LAR_selectionv1_guide_v5.fasta | grep -e '>[A-Z][A-Z][A-Z][A-Z]-' | cut -c 1-5 | sort | uniq > data/1kP_ids_v1.txt

get the DNA ids:

In [43]:
grep '>' data/1kP_LAR_selectionv1_guide_v5.fasta | grep -e '>[A-Z][a-z][a-z][a-z][a-z]_' | cut -c 1-6 | sort | uniq >> data/1kP_ids_v1.txt

Next, I´ll use these IDs to extract a subset from the orthogroup, first I'll have to linearise the fasta, then I can grep the sequences.

In [63]:
cat data/$inseq.fasta \
  | awk '/^>/ {printf("%s%s\n",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' \
  > data/"$inseq"_linear.fasta

Thinking about it, I can remove those apendeces to sequence names again.

In [66]:
sed -i 's/_AtLDOX-like$//'  data/orthogroup_AtLDOX_AT4g22880*
sed -i 's/_VvANS-like$//'  data/orthogroup_VvANS.fasta

In [7]:
head -n 4 data/"$inseq"_linear.fasta

>RHAU-2061976-RHAU-Chamaseyce_mesebyranthemum-2_samples_combined
NQWPLSPPNFREDCEEYAEEMERLAYLLLELIAKSLGLKANRFPDYFKDQTSFIRLNHYPPCPASELALGNGRHKDGGALTILAQDEVGGLQVKRKSDGEWVSVNPILNSYIINVGEIIQV
>NVGZ-2012099-NVGZ-Cephalotaxus_harringtonia-2_samples_combined
VVATLSQACKEWGFFYVVNHGIPQELLQNMESLVSRLFAMPAELKQKAATSNRKESYDITERIENFCFANLPHSDSVQAMCDRIWPEEGNPEFCETIRTYISYVADLQRKISKMFTAGLGLDVDTFYHSDFENYVSHLRINNYYSDGTMSMEEEFPSAHTDIGCFTIVNTGKDEGLQVRSNEGNWVNVKPLPHSFVVNVGDCLKAWTNRRYQSREHRVVYKGWENRISLPYFVNFPADKQIWAPAELVDDNHPRRYRPFTFSQF


In [69]:
grep -f ./data/1kP_ids_v1.txt data/"$inseq"_linear.fasta -A 1 --no-group-separator > data/"$inseq"_selection-v1.fasta

In [70]:
grep '>' -c data/"$inseq"_selection-v1.fasta
wc -l data/"$inseq"_selection-v1.fasta

4253
8506 data/orthogroup_AtLDOX_AT4g22880_selection-v1.fasta


~~Great, It's still quite a bunch but this we can work with.~~
oops, no I mean this is quite a lot, actually 8 times higher than I had aimed for. I might just let it go for now, but I'll have to do a second pass removing a bunch of sequences.

## 1.3 add guide sequences
If you have versions of your sequence of special interest, or functionally verified ones, be sure to add them! 

I imagine you have your guide sequences named something like `data/guide_sequences_v1.fasta` Combine the two files like so and update the `$inseq` variable with the new name if you are done in this section.

first, also linearise the guide sequences

In [9]:
cat data/guide-sequences-v2.fasta \
  | awk '/^>/ {printf("%s%s\n",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' \
  > data/guide-sequences-v2_linear.fasta

In [11]:
cat data/ANS-likes_Azolla-filiculoides_v2.fasta \
  | awk '/^>/ {printf("%s%s\n",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' \
  > data/ANS-likes_Azolla-filiculoides_v2_linear.fasta

now combine

In [16]:
# for the selection workflow
cat data/guide-sequences-v2_linear.fasta data/ANS-likes_Azolla-filiculoides_v2_linear.fasta data/"$inseq"_selection-v1.fasta > data/"$inseq"_selection-v1_guide-v2.fasta

In [17]:
head data/"$inseq"_selection-v1_guide-v2.fasta

>Sinningia_cardinalis_ANS
MAPSPRVEILASSGNQSIPMEYVRSKEELKSMCTDIFLEEKSNEGPQVPIIDLKAISAEDEDTRKKCHEELKKAAMEWGVMHLINHGVSEEIISRVKIAGKEFFELPVEEKEKYANDQAAGNLHGYGSKLANNANGMLEWEDYFFHCIFPEEKRDFSIWPENPTDYIPSVSQYGKQLRGLATKILSVLSLGLGLEEDRLENEVGGMEELLLQHKINYYPKCPQPELALGVEAHTDISALTFILHNMVPGLQLFYQGKWVTAKCIPNSIIMHIGDTVEILSNGKYKSILHRGSVNKEKVRISWAVFCEPPMEKIVLKPLPETVSQDKPPLFPPCTFAEHVKHKLFRNGSEDAAVDKKSIGDECKEARG
>AaFNS
MAPTTITALAQEKTLNLAFVRDEDERPKVAYNQFSNEIPIISLAGMDDDTGRRPQICRKIVEAFEDWGIFQVVDHGIDGTLISEMTRLSREFFALPAEEKLRYDTTGGKRGGFTISTHLQGDDVKDWREFVTYFSYPIDDRDYSRWPDKPQGWRSTTEVYSEKLMVLGAKLLEVLSEAMGLEKEALTKACVNMEQKVLINYYPTCPEPDLTLGVRRHTDPGTITILLQDMVGGLQATRDGGKTWITVQPVEGAFVVNLGDHGHYLSNGRFKNADHQAVVNSTSSRLSIATFQNPAQNAIVYPLRIREGEKAVLDEAITYAEMYKKNMTKHIEVATLKKLAKEKRLQEEKAKLETESKSADGISA
>AaH6H
MATLVSNWSTNNVSESFKAPLEKRAEKDVPLGNDVPIIDLQQDHHLVVQQITKACQDFGLFQVINHGFPEKLMAETMKVCKEFFALPAEEKEKLQPKGKPAKFELPLEQKAKLYIEGEQLSNGELFYWKDTLAHGCHPLDEELVNSWPEKPATYREVVSKYSVEVRKLTMRMLDYICEGLGLKLGYFDNELSQIQMMLANYYPPCPDPSSTLGSGAHYDGNVITLLQ

In [19]:
grep '>' -c data/"$inseq"_selection-v1_guide-v*.fasta

data/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1.fasta:4301
data/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2.fasta:4364


And reset variable, and check.

In [20]:
# for the selection workflow
inseq="$inseq"_selection-v1_guide-v2
echo $inseq

orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2


In [21]:
head data/$inseq.fasta

>Sinningia_cardinalis_ANS
MAPSPRVEILASSGNQSIPMEYVRSKEELKSMCTDIFLEEKSNEGPQVPIIDLKAISAEDEDTRKKCHEELKKAAMEWGVMHLINHGVSEEIISRVKIAGKEFFELPVEEKEKYANDQAAGNLHGYGSKLANNANGMLEWEDYFFHCIFPEEKRDFSIWPENPTDYIPSVSQYGKQLRGLATKILSVLSLGLGLEEDRLENEVGGMEELLLQHKINYYPKCPQPELALGVEAHTDISALTFILHNMVPGLQLFYQGKWVTAKCIPNSIIMHIGDTVEILSNGKYKSILHRGSVNKEKVRISWAVFCEPPMEKIVLKPLPETVSQDKPPLFPPCTFAEHVKHKLFRNGSEDAAVDKKSIGDECKEARG
>AaFNS
MAPTTITALAQEKTLNLAFVRDEDERPKVAYNQFSNEIPIISLAGMDDDTGRRPQICRKIVEAFEDWGIFQVVDHGIDGTLISEMTRLSREFFALPAEEKLRYDTTGGKRGGFTISTHLQGDDVKDWREFVTYFSYPIDDRDYSRWPDKPQGWRSTTEVYSEKLMVLGAKLLEVLSEAMGLEKEALTKACVNMEQKVLINYYPTCPEPDLTLGVRRHTDPGTITILLQDMVGGLQATRDGGKTWITVQPVEGAFVVNLGDHGHYLSNGRFKNADHQAVVNSTSSRLSIATFQNPAQNAIVYPLRIREGEKAVLDEAITYAEMYKKNMTKHIEVATLKKLAKEKRLQEEKAKLETESKSADGISA
>AaH6H
MATLVSNWSTNNVSESFKAPLEKRAEKDVPLGNDVPIIDLQQDHHLVVQQITKACQDFGLFQVINHGFPEKLMAETMKVCKEFFALPAEEKEKLQPKGKPAKFELPLEQKAKLYIEGEQLSNGELFYWKDTLAHGCHPLDEELVNSWPEKPATYREVVSKYSVEVRKLTMRMLDYICEGLGLKLGYFDNELSQIQMMLANYYPPCPDPSSTLGSGAHYDGNVITLLQ

# 2. Aligning

Now we have our fasta file with a feasible amount of sequences. Next step is aligning these. While this may seem trivial, the method of aligning can actually influence your end results quite a bit. Roughly speaking there is several alignment algorithms:

1. progressive
   - mafft
   - clustal?
2. pair-wise
   - mafft
   - ...
3. ...

Especially for bigger datasets, I prefer mafft for it is simply very fast and gave me good results in the past. But by all means try more ways if you get odd results.

If you have a good idea of what you're doing, and you want to run multiple alignments in a loop and go have lunch, have a look at section 2.2.

## 2.1 running alignments (one by one).

### 2.1.1. MAFFT [online](https://mafft.cbrc.jp/alignment/server/)

If you find mafft options and parameters confusing, and/or you have difficulty making alignments, 
then you may tre to use the online service [here](https://mafft.cbrc.jp/alignment/server/). 
The online MAFFT service does a good job at explaining the parameters and has a nice visualisation as well!
So read the webpage, and choose your options and parameters aided by the explanations in the webpage. When you submit your job, the mafft command issued in the background is actually shown to you! Hence you can copy paste that command here if you'd like. That's especially useful when the server is under high load, in this notebook you may choose to use all threads available on your computer `--threads $(nproc)`.

Using the MSAviewer that you can open after running your alignment on this server, you can even interactivelly trim. From a reproducibility/scaling point of view, this is not ideal, but to get a feeling for what you are doing, it is very usefull. Just make sure you keep a record of what you do, and keep intermediate results with clear names.

### 2.1.2 MAFFT local
I'll start with showing you my go-to approach. First, have a look at the manual. 

Next I'll make a directory to store the untrimmed (hence raw) alignments and run the alignment on all available CPU cores.

I like to do this in 'if loops' to prevent re-doing things unnecessarily.

Linsi is probably the most acurate mafft setting (as declared by the MAFFT authors). It is turned off by default in normal or auto mafft for alignments bigger than 200 sequences. Typically, it only takes a couple of minutes so I don't mind the wait. Building a tree takes a lot longer so these extra minutes are a sensible investment to me.

In [22]:
#rm "./data/alignments_raw/$inseq"_aligned-mafft.*
conda activate phylogenetics
if    [ ! -d ./data/alignments_raw/ ]
then  mkdir  ./data/alignments_raw
fi
if    [ ! -f "./data/alignments_raw/$inseq"_aligned-mafft.fasta ]
then  mafft --auto --thread $(nproc) data/$inseq.fasta \
              > ./data/alignments_raw/"$inseq"_aligned-mafft.fasta \
              2> ./data/alignments_raw/"$inseq"_aligned-mafft.log
fi
conda deactivate

(phylogenetics) (phylogenetics) (phylogenetics) 

In [None]:
#rm "./data/alignments_raw/$inseq"_aligned-mafft.*
conda activate phylogenetics
if    [ ! -d ./data/alignments_raw/ ]
then  mkdir  ./data/alignments_raw
fi
if    [ ! -f "./data/alignments_raw/$inseq"_aligned-mafft-einsi.fasta ]
then  einsi --thread $(nproc) data/$inseq.fasta \
              > ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.fasta \
              2> ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.log
fi
conda deactivate

(phylogenetics) (phylogenetics) 

In [30]:
tail ./data/alignments_raw/"$inseq"_aligned-*.log

==> ./data/alignments_raw/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi.log <==
If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

Parameters for the E-INS-i option have been changed in version 7.243 (2015 Jun).
To switch to the old parameters, use --oldgenafpair, instead of --genafpair.


==> ./data/alignments_raw/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft.log <==
 FFT-NS-2 (Fast but rough)
 Progressive method (guide trees were built 2 times.)

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version

In [31]:
ls ./data/alignments_raw -sh

total 144M
 45M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft-einsi.fasta
3.1M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft.log
 47M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi.fasta
5.0M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi.log
 46M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft.fasta
 56K orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft.log


In [None]:
head ./data/alignments_raw/"$inseq"_aligned-mafft*.fasta

![](data/alignments_raw/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft.png)

# 3. Alignment trimming

Odds are, your alignment is quite gappy which may confuse tree building algorithms. Often it is better to remove gappy columns in your alignment. Let's have a look at this with `trimAl`. Short for 'trim alignment' (I guess). No Artificial intelegence stuff going on here.

As always, have a look at the help page.

You can evaluate the trimming in the webpage that trimal made. Browse in the jupyter file browser to: 

> data/alignments_trimmed/...trim-auto.html'

the webpage should open in your browser and you can check how many sequences and collumns have been retained, and see exactly which ones. If you are contect, proceed to tree building!

### 3.2 Tweak trimming parameters

Alternatively, you may tweak your own trimming parameters like so. 

Everytime I change parameters, I change the variable `$trimappendix` to reflect those changes. Second, I explain briefly in a text cell why I chose to do so.

In [107]:
if    [ ! -d data/alignments_trimmed ]
then  mkdir  data/alignments_trimmed 
fi
conda activate phylogenetics
# define appendix only once here:
trimappendix='trim-gt4'


for a in "data/alignments_raw/$inseq"_aligned*.fasta
do  appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    if    [ ! -f data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta ]
    then  echo "trimming alignment $a"
          sed -i 's/ /_/g' $a
          trimal -in $a   \
                 -out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta \
                 -gt .4 2> /dev/null &
    fi
done
conda deactivate

(jalview) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) trimming alignment data/alignments_raw/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi.fasta
[1] 1455262
trimming alignment data/alignments_raw/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft.fasta
[2] 1455272
(phylogenetics) (jalview) 

: 1

In [108]:
ls data/alignments_trimmed -sh

total 582M
1.5M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq90-res90.fasta
 42M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq90-res90.png
1.4M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90.fasta
 40M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90.png
1.6M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4.fasta
1.6M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq90-res90.fasta
 45M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq90-res90.png
1.6M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq92-res90.fasta
 45M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq92-res90.png
1.6M orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq94-res90.fasta
 45M orthogro

: 1

In [103]:
conda activate jalview
for   i in data/alignments_trimmed/*.fasta
do    prefix=$(echo $i | sed 's/\.fasta//')
      if    [ ! -f $prefix.png ]
      then  jalview -nodisplay \
                    -open $prefix.fasta \
                    -colour CLUSTAL \
                    -png  $prefix.png > /dev/null 2> /dev/null &
      fi
done
wait
conda deactivate

(jalview) [2] 1314579
(jalview) 
[2]+  Done                    jalview -nodisplay -open $prefix.fasta -colour CLUSTAL -png $prefix.png > /dev/null 2> /dev/null
(jalview) 

In [59]:
ls data/alignments_trimmed/*.png -sh

 42M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq90-res90.png
 40M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90.png
 45M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq90-res90.png
 45M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq92-res90.png
 45M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq94-res90.png
 45M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq95-res92.png
 45M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq95-res94.png
 44M data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq96-res90.png
 43M data/alignments

In [109]:
grep -c '>' data/alignments_trimmed/*.fasta | sed 's,data/alignments_trimmed/,,'

orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq90-res90.fasta:4301
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90.fasta:4154
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4.fasta:4364
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq90-res90.fasta:4364
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq92-res90.fasta:4364
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq94-res90.fasta:4346
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq95-res92.fasta:4332
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq95-res94.fasta:4323
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4-seq96-res90.fasta:4216
orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4.fasta:4364
orthogroup_AtLDOX_AT4g

: 1

Based just on the number of lines, 767 is too few, but the original number 4300-ish, is too high. Let's take a look at:
 - trim-gt4-seq95-res90.fasta:4154
 - trim-gt4-seq95-res92.fasta:4087

#[](data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90.png)

#[](data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4-seq95-res92.png)

Since the jpgs are sooo big, I chose not to include them in the notebook here or in git, but the fasta's are included, as is the code to visualise them so anyone can just do that too.


Looking at the allignments, I still think the alignments have quite some 'horizontal banding' which I don't like. Rather than painstakingly trying to optimese the trimAL setting, I'm going to remove all sequences for which IQtree provides a sequence content warning. This is way faster and will take care of my problem as well. Besides the horizontal banding I'm happy with the gap threshold of .4.

To go ahead, I'll make trees of the most second most stringently trimmed set of sequences: __trim-gt4-seq95-res92.fasta:4087__

now make a list of sequence numbers to be removed based on iqtree intermediate sequence content assessment:

In [97]:
grep '  failed  ' ./analyses/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_fasttrees/aligned-mafft_trim-gt4-seq95-res92/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4-seq95-res92_iqtree-fast.log | grep -o -e '^ *[0-9]*' | sed 's/$/,/g' | tr -d ' ' | tr -d '\n' | sed 's/,$//'

69,70,85,86,129,238,254,272,304,349,380,389,427,539,549,552,564,648,673,684,686,700,706,739,811,839,856,867,889,897,926,939,942,949,991,992,998,1017,1029,1062,1069,1081,1179,1203,1204,1207,1241,1256,1264,1273,1278,1287,1366,1409,1446,1459,1469,1470,1480,1541,1602,1617,1635,1650,1654,1736,1756,1766,1768,1813,1870,1890,1904,1913,1918,1958,1971,2022,2048,2049,2058,2080,2093,2173,2190,2248,2287,2292,2323,2351,2357,2380,2383,2415,2432,2433,2443,2473,2475,2484,2498,2514,2542,2563,2564,2581,2614,2639,2649,2653,2662,2671,2689,2756,2767,2780,2823,2824,2924,2935,2979,2992,3027,3045,3074,3094,3164,3175,3200,3252,3366,3375,3389,3426,3453,3461,3472,3525,3564,3600,3618,3620,3633,3652,3660,3703,3711,3792,3800,3845,3847,3883,3884,3909,3957,3962,3971,3991,3993,4025

In [100]:
conda activate phylogenetics
trimal -in data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4-seq95-res92.fasta \
       -out data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4-seq95-res92_seqrm-iqtree-content.fasta \
       -selectseqs \{ $(grep '  failed  ' ./analyses/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_fasttrees/aligned-mafft_trim-gt4-seq95-res92/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4-seq95-res92_iqtree-fast.log | grep -o -e '^ *[0-9]*' | sed 's/$/,/g' | tr -d ' ' | tr -d '\n' | sed 's/,$//') \}
conda deactivate

(phylogenetics) (phylogenetics) 

In [102]:
grep '>' -c data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft_trim-gt4-seq95-res92_seqrm-iqtree-content.fasta

3927


Well that's slightly better I guess... But not as good as I had hoped. Anyway, I also don't want to take out too much so I'm leaving this for now, making a tree and then we decide again.

# 4. Fast tree building
Here we'll make fast trees: not acurate, no bootstraps, but fast. This gives us an idea of the output and how we will process it. Building 'propper' trees can take days sometimes weeks, so it's better to be sure you have all sequences in there you want before you start. 

I use two ways to make thise fast trees, first with a program called `fasttree` and second with the programm `iqtree` with the `-fast` parameter. My gut feeling is that the latter is a bit more acurate but takes a couple of minutes. Fasttree takes seconds.

I arbitrarily consider trees to be analyses and not data, hence I store these in the `analyses` directory.

Since these trees run fast (just take a second to consider how rediculous that sounds) I propose to run these in loops again, taking all the trimmed alignments that were made earlier. The trees run in parallel on one CPU. If you're running many trees (way more than you have computing cores) then don't run these in the background. Practically, that means removing the `&` character almost at the end of the loop.

## 4.2 IQtree -fast

And here is the same but for running iqtree. I picked some random model here, but substitute it by anything you like better or have good experience with it the past.

Based on a previous model fit to a very similar alignment, these two models should fit best according to different criteria:
 - LG+G4
 - JTT+R10
 
 Given de size of the alignment, I'm inclined to go with the former.

In [111]:
#for a in data/alignments_trimmed/"$inseq"_aligned*.fasta
conda activate phylogenetics
for a in data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4.fasta
do  echo "making a iqtree fast tree of file $a"
    appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    echo $appendix
    if   [ ! -d   analyses/"$inseq"_fasttrees/"$appendix" ]
    then mkdir -p analyses/"$inseq"_fasttrees/"$appendix"
    fi
    
    iqprefix=analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_iqtree-fast
    if   [ ! -f "$iqprefix".iqtree ]
    then nice iqtree -s $a -fast \
                     -m 'JTT+R10' \
                     -pre "$iqprefix" \
                     > "$iqprefix".stdout \
                     2> "$iqprefix".stderr &
    fi
done
wait
conda deactivate

(jalview) (phylogenetics) making a iqtree fast tree of file data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4.fasta
aligned-mafft-einsi_trim-gt4
[1] 1455988
(phylogenetics) [1]+  Done                    nice iqtree -s $a -fast -m 'JTT+R10' -pre "$iqprefix" > "$iqprefix".stdout 2> "$iqprefix".stderr
(phylogenetics) (jalview) 

: 1

### intermediate conclusion

Apparently, I have thrown out some Azolla sequences of interest with trimal, lets just ditch the advanced trimal settings and I'll go with simple sequence gap content instead.

In [None]:
cat analyses/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_fasttrees/aligned-mafft-einsi_trim-gt4/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4_iqtree-fast.log | grep -E ' *[0-9]+.+ *[0-9]+\.[0-9]+\% +[passedfil]+ + [0-9]+\.[0-9]+\%' | grep -Eo '[0-9]+\.[0-9]+\% +[passedfil]+ + [0-9]+\.[0-9]+\%'  | sed -E 's/ +/ /g' | cut -f 1 -d ' '| tr -d '%' | sort -n | cut -f 1 -d '.' | uniq --count


In [None]:
cat analyses/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_fasttrees/aligned-mafft-einsi_trim-gt4/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v2_aligned-mafft-einsi_trim-gt4_iqtree-fast.log \
    | grep -oE ' *[0-9]+.+ *[0-9]+\.[0-9]+\% +[passedfil]+' \
    | sed -E 's/ +/ /g' \
    | tail -n +112 \
    | grep -E '[4-9][0-9]\.[0-9]+\%' \
    >

## 4.3 Visualise your fast trees. 

To visualise your trees, you perhaps already have something installed like mega, seaview, etc. Otherwise you can upload the tree file to [iToL](https://itol.embl.de/) (my prefered method) or any other website that visualises trees. See section 6 for uploading your trees to iToL.

Alternativelly, we can try to get a quick snapshot here in the notebook:

## 5. Building trees with IQtree 

Finally, we're at the stage to build propper maximum likelyhood phylogenetic trees! Based on your previous results, you should have one or two trimmed alignments you want to make a tree of. There is several choices to make still: a model of evolution and a bootstrapping method.

**modelfinder**

IQtree is a state-of-the art tree buildling program, which has a model finder algorithm included! This can take a couple of hours, so be sure to do this only once. There is two model finder options, a quick one with some often used models: `-m TEST` or an extended modelfinder, using more models of evolution and substitution: `-m MFP`. I recommend the latter. Once you have your best-fit model (for example: 'LG+R7') then use this model when you build more trees from the same alignment: `-m 'LG+R7'`

**bootstrapping**

Normal or 'non-parametric' bootstrapping can take quite a long time; I have had trees running for weeks. Hence there is alternatives that are a lot faster but might over or underestimate the bootstrap values if your alignment doesn't fit your model well. To use 'normal bootstraps' the minimum is 100. That's why I like to to 200 to be safe, by adding the option `-b 200`.

Alternativelly, there is the 'ultrafast bootstrap' option in IQtree. The minumum for this is 1000 bootstraps, so I'd like to do double by including the parameter: `-bb 2000`. Additionally, I highly recommend also running the approximate likelyhood ratio test for 2000 bootstraps at the same time by including parameters `-alrt 2000`. This adds a minimal amount of run time and makes interpretation of your tree a lot more reliable.

As the [IQtree FAQ](http://www.iqtree.org/doc/Frequently-Asked-Questions#how-do-i-interpret-ultrafast-bootstrap-ufboot-support-values) says: typically you start believing a clade when the ultra fast bootstraps => 95 and alrt => 80. Interpretation of these values is not linear like 'normal' bootstrap, hence if you lower the threshold of ultrafast bootstraps to 90, you will likely enormously overestimate your results. 

**other command-line options**

In the commandline I wrote below, I instruct iqtree to use no more CPU cores than your computer has, but also to find the optimum amount of cores (more is not always better). Second, a prefix is defined to store the different trees that IQtree wil make.

**More info**
* iqtree tutorial: http://www.iqtree.org/doc/Tutorial
* aLRT: https://www.ncbi.nlm.nih.gov/pubmed/16785212





## running IQtree

Now these are all trimmed alignments you have available. 
Choose one to start with (based on your fasttrees or inspections of your alignments).

Make sure that 
1. the path to this alignment is the variable `$a` 
2. you choose an appendix based on your iqtree settings

In [None]:
ls data/alignments_trimmed/"$inseq"_aligned*fasta

In [None]:
a=data/alignments_trimmed/orthogroup_AtLDOX_AT4g22880_selection-v1_guide-v1_aligned-mafft_trim-gt4-seq95-res90.fasta

#iqpendix='iqtree-b200'
iqpendix='iqtree-bb2000-alrt2000'

echo "making a tree of file $a"
echo "The first lines of alignment $a look like this"
head $a

file_appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
echo "Making a directory $file_appendix to store trees (name based on alignment filename)"

if   [ ! -d    analyses/"$inseq"_trees/"$file_appendix" ]
then mkdir -p  analyses/"$inseq"_trees/"$file_appendix" 
fi

iqprefix=analyses/"$inseq"_trees/"$file_appendix"/"$inseq"_"$file_appendix"_"$iqpendix"
if   [ ! -f "$iqprefix".tree ]
then nice iqtree -s $a \
                 -m 'JTT+R10' \
                 -bb 2000 \
                 -alrt 2000 \
                 -nt AUTO \
                 -ntmax $(nproc)  \
                 -pre  "$iqprefix" \
                 2>   "$iqprefix".stderr \
                 >    "$iqprefix".stdout \
                 && cat "$iqprefix".out | mail -s ANS_IQtree_run laura.w.dijkhuizen@gmail.com
fi

In [None]:
ls $iqprefix* -1

You can have a look at the last lines of your log file like this:

In [None]:
tail -n 40 $iqprefix.log

Are you content with your tree? Great news! If you want to do another run, I recommend copying the cell above and editing the copy. That way you keep the code for all trees you made. Don't forget to explain what you observed, why you're making a new tree, and what you're changing (remember this is your labjournal). 

# tree storage

For tree storage and sharing, I have yet to encounter a better tool than EMBLs [iToL](https://itol.embl.de/). It's a great interface for exploring and sharing trees with colleagues. You can browse to the treefile IQtree created on your computer and upload it to iToL. Alternativelly, you can copy paste the contents of the file to iToL. Make sure to keep the original filename as well! This file name now contains a brief summary of how this tree was made.

On request, extracting all sorghum sequences

In [5]:
grep Sorghum_bicolor -A 1 --no-group-separator ./data/orthogroup_AtLDOX_AT4g22880_selection-v1.fasta > data/Sorghum-subset-LDOX-orthogroup.f

>Sorbi_v2_1-Sobic_002G003100_1-Sorghum_bicolor
MEDYDYEPPLMATYRHLLDSHPHRLDVVDHRSGADDDEEGFLLPVIDLSSLLEQSSSGAEAAAEQCRASIVRAASEWGFFQVTNHGVPQALLDELHQAQVAVFRRPFHLKASQPLLDFSPESYRWGTPTATCLDQLSWSEAYHIPTTNTTAAADDKTRLVVEEVSTAMSKLAQRLAGILVADLLLGDSSIGDGEDDDTAAAVVSRCTRSTCFLRLNRYPPCPAPSGAYGLCPHTDSDFLTILHQDGVGGLQLVKAGRWVAVKPNPGALIVNVGDLLQAWSNDRYRSVEHRVMASDARERFSVAFFLCPSYDTLVRPRCGAGGPPRYESFTFGEYRNQIREDVRLTGRKLGLQRFRKPE
>Sorbi_v2_1-Sobic_008G109600_1-Sorghum_bicolor
MEINLLHVTGPSHASLPVPDSYAVPQLPQAKATPTDITLPLIDLSRSRGELRRAILDAGKEFGFFQVINHGVPEQVLQDVEAVSEEFFQLPAADKAHFYSDDTNRPNRLFSGSTYKTSKRLYWMDCLRLARAFPGGDSKKEWPEKPEELRNVYENYTALMRGLGLELLHMLCEGLGLPSDYFDGDQSAGDMILSVIRYPPCPTPDVTLGLPPHCDRNLITLVLSGSVPGLQVLYKGDWIMVKPIRNSFVINFGLHLEVVTNGIIKSVEHRVITNSVQARTSLVITLNGTEDCLIGPAGELLGENKPPRYRTVTLRDFMRIYNKSLENPDAAIKERMKPFMI
>Sorbi_v2_1-Sobic_010G060200_1-Sorghum_bicolor
MADESWRIPMLVQELAAKVQQPPSRYVQPEQYHPVSLDVGAETPEPIPVIDLSRLSAAADAAGESGKLRLALQSWGLFLVANHGIETDLMDDLIDASREFFHLPLEEKQKCSNLIDGKYFQVEGYGNDPVRSKDQNLDWLDRLHLRVEPEDERNLVHW

>Sorbi_v2_1-Sobic_009G064700_1-Sorghum_bicolor
MMPSSSSSASTPAAASGGLFELGSAASVPETHAWPGVNEHPSVESAGRDAVPVVDMGMGGPDDADAAARAVARAAEEWGAFLLVGHGVPRGVAARAEAQVARLFALPAPDKARAARRRRAAAAAAAGYGMPPLALRFSKLMWSEAYTFPAAAVRDEFRRVWPDAGDDYLRFWYVRTPVWYVTRATLPHVDLHLHVHACMRAPSDVMEEYDREMRALGGRLLDLFFMALGGGLTDDDQIAGGETTTTERKIRDNLTAMMHPILYPKCPEPERAMGLAPHTDSGFITLITQSAGVPGLQLLRRGPDRWVTVPAPPGAFVVVLGDLFQVLTNGRYRSALHRAVVNRERDRISVPYFLGPPDGMKVAPLASALLPGRRKAAFRAVTWPEYMELKHKVLGTDTSALEMLQLDEEEM
>Sorbi_v2_1-Sobic_001G407800_1-Sorghum_bicolor
MTSCFNGGAGWPEPVVRVQHVSDTCGDTIPERYVKPPSERPCLSPAAASSGGVGGGGGGGPNIPVVDLSMLDVDATSRAVAAACREWGFFQAVNHGVRPELLRSGRAAWRGFFRQPAVVRERYANSPATYEGYGSRLGTAKGGPLDWGDYYFLHLLPASLKSHEKWPSLPSSLRGTTEEYGEEVLQLCRRVMRLLSSGLGLEAGRLQAAFGGEGGEGACLRVNLYPRCPQPELTLGVAGHSDPGGMTMLLVDDHVKGLQVRSPDGQWIIVDPVPDAFIVNVGDQIQVLSNASYKSVEHRVTVSAAEDRLSMAFFYNPRSDLPIAPMPELVGPGRPALYPEMTFDEYRVFIRQRGLAGKAQLQSLQANQTAAAAAAAGSSSTCS
>Sorbi_v2_1-Sobic_007G167400_1-Sorghum_bicolor
MAIVDLANAQLQQAGAGAAAATMREDDDGHDHEQESSYDYGACLMKGVRHLSDSGITRLPDRY

>Sorbi_v2_1-Sobic_001G366100_1-Sorghum_bicolor
MADAARGMGSPSLPVANVQALAETCNTGVDEPVPWRYLSKDPTAEEVVAADDSACAIPVIDFRKLLDPESSSSECARLGSACHHWGFFQLINHGVPDEVIANLKKDVVGFFKQPLEAKKECAQQADSLEGYGQAFVVSEDQKLDWADMLYLIVQPRESRDMRFWPTRPASFRDSVDSYSMEASKLAYQLLEFMAKGVGAADDDDDPAASLRLQGVFQGQVRGMRVNYYPPCRQAADRVLGLSPHTDPNGLTLLLQMNDHDVQGLQVSKDGRWFPVQALDGAFVVNVGDALEIVSNGAFKSVEHRAVIHPTKERISAALFHFPDQDRMLGPLPELVKKGDRVRYGTRSYQDFLKQYFTAKLDGRKLIESFKLE
>Sorbi_v2_1-Sobic_009G230800_1-Sorghum_bicolor
MVAITAPSSIEQIPLVQCPRANASAAIPCVDLSAPGAAAAVADACRGVGFFRATNHGVPARVVEALEARAMAFFALPAQEKLDMSGAARPMGYGSKRIGSNGDVGWLEYLLLSVSANTVKISSLPPSLRAALEEYTAAVREVCGRVLELIAEGLGVDRSLLRAMVVGREGSDELVRVNHYPPCPLLPPVDCGVTGFGEHTDPQIISVLRSNSTAGLQIKLRDGRWVPVPPAPESFFVNVGDALQVLTNGRFKSVKHRVVAPEGAQSRLSVIYFGGPAPSQRIAPLPEVMRDGEQSLYREFTWAEYKTAMYKTRLADHRLGPFELRATNTNSCVPPPPPPSVDPYCNGSGICMPQPPPQQQQVAEVH
>Sorbi_v2_1-Sobic_006G190100_1-Sorghum_bicolor
MAPAISKPLLSDLVAQIGKVPSSHIRPVGDRPDLANVDNESGAGIPLIDLKMLNGPERRKVVEAIGRACESDGFFMVTNHGIPAAVVEGMLRVAREFFHLPESERLKCYSDDPKKAIRL

>Sorbi_v2_1-Sobic_004G348400_1-Sorghum_bicolor
MAEPLSNGAVYHSVPESYVLPEHKRPGSSPPSCSAAAIPVVDLGGDDTDRMAEQIVAAGREFGFFQVINHGVPEDVMRAMMSAAEEFFKLPTEEKMAHYSTDSTKLPRFHTSVGKEQEQLLYWRDCLKIGCYPFEEFRRQWPDKPAGLGAALEPYTAAVRGVALRVLRLAASGLGLADEAHFEAGEVTAGPVIMNVNHYVACPEPSLTLGIAPHCDPNVVTVLMDNGVRGLQARRRHGHQGNGEGGGGWVDVDPPPGALIVNFGHQMEVVTNGRVRAGEHRAVTNARAPRTSVAAFVMPAMGCVVSPAPEMVAEGEAPLLRPYTYQEFVGVYTAANGDRDAVLARLQNNNG
>Sorbi_v2_1-Sobic_010G194100_1-Sorghum_bicolor
MADDQPAWKIPPIVQELTAGVQEPPSRYVVGEQDRPAMAAAAAMPEPIPIVDLSRLSANDGADDDETAKLLSALQNWGLFLAVGHGMDPGFLTEMMEVTRGFFNLPLDEKQKYSNLANGKEFRFEGYGNDMVLSEDQVLDWCDRLYLTVEPESRIVRSLWPAQPPAFSDVLREYTTRCREIAGVVLASLARLLGLHEGRFVGMMSDGVAMTHARFNYYPRCPEPDRVLGLKPHSDASVITVVLIDDAVGGLQVQKPNDDDGVWYDVPIVPNALLVNVGDVTEIMSNGLFRSPVHRAVTNAESDRVSLAMFYTLDSEKEIEPLPELVDDKRPRRYRKTTTKDYLALLFERFTRGERALDAVKIDLNDD
>Sorbi_v2_1-Sobic_001G340400_1-Sorghum_bicolor
MADAAATAGKLFGREKITDTTVTLFAESANKIPDERFIRTKEVQAAGAVVGEDDEMPLELPVVDMASLVDPDSSASETAKLGSACREWGFFQLTNHGVEEAAMQQMKDSAAEFFRSPLESKNTVAVRDGFQGFGHHFNG

>Sorbi_v2_1-Sobic_010G239900_1-Sorghum_bicolor
MGSDFKSIPLIDISPLVEKIDDPSMANDRDLLQVVRLLDDACKEAGFFYVKGHGIDESLMREVRNVTRKFFQLPYEEKLKIKMTPQSGYRGYQRLGENITKGKPDMHEAIDCYTPIRPGKYGDLAKPMEGSNLWPEYPSNFEVLLENYINLCRDISRKIMRGIALALGGAIDAFEGETAGDPFWVLRLIGYPVDIPKEQRTDTGCGAHTDYGLLTLVNQDDDICALEVQNRSGEWIYATPIPGTFVCNIGDMLKVWTNGIYQPTLHRVVNNSPRYRVSVAFFYESNFDAAIEPVEFCREKTGGAAKYEKVVYGEHLVQKVLTNFVM
>Sorbi_v2_1-Sobic_003G143100_1-Sorghum_bicolor
MCVCERDYSNLYRTETSLQQEEEEPTLLSMAHAKSAGGNLQVPNVQALSQTWNQSGELVPARYVRTEETSDAVVVAGCALPVVDLGRLLDPRSSQEELAVLGSACQQGFFQLVNHGVPDDVVLDVRRDIAEFFRLPLEAKKVYAQLPDGLEGYGQAFVFSEAQKLDWSDMMYLMLRPVESRDMSFWPVHPPSFRTSVDRYSAEAAKVVWCLLRFMAADMGVEPELLQEMFAGQPQTMKMTYYPPCRQADKVIGLSPHTDACAVTLLLHVNDVQGLQIRMDDGKWHPVEPLDGALIVSVGDIIEILSNGKYRSIEHRAVVHPDKERISAAMFHQPRGSITVEPLPELVKKDGGVARYKSVGYAEFMKRFFSAKLDGRKGHLDHFRV
>Sorbi_v2_1-Sobic_008G109300_1-Sorghum_bicolor
MENLLHVTPSHVSLPNSYAVPQLPQAKATPTDISLPVIDLSRSRDEVCRAILDAGKEFGFFQVINHGIPEQVLQDMESVSEEFFQLPAADKAHFYSEDTNRPNRLFSGSTYKTSKRLYWMDCLRLARTFPGSDCKKEWPEKPEELR

tree sel1 v1: https://itol.embl.de/tree/9421021579222211612506891
![image-2.png](attachment:image-2.png)


selection1 guide v2: https://itol.embl.de/tree/1312115897247661613463548 with some sequences removed based by trimal
![image.png](attachment:image.png)


selection v1 guide v2 without trimal seq removal: https://itol.embl.de/tree/9421021579312521613499044
![image-2.png](attachment:image-2.png)
