## **Spoiler: no partitioning because of problems with the data**

❌ Steps to achive our goal "Article reproduction":

1. Download all fasta for genes COI and 18S
2. Sequnces alignment separately for 2 genes (MAFFT)
3. Check the alignments (Ugene)
4. Using Trimal for trimming the alignments if necessary
5. Use IQTree to get the right model for each gene
6. Using FASconCAT-G to concatenate 2 genes
7. Use explanation from [IQTree website](http://www.iqtree.org/doc/Complex-Models) to get the tree with partitioning
8. Visualize the tree

Additional part:

*   Use species only with the rearrangements to get the close up on it (repeat the previous steps but only with this species)
*   To visualize the tree add the metadata


> I guess it would be better to use different collab motebooks for different parts of the work ("Downloading data", "General tree reconstraction" and "Rearrangements tree")

---

> We wanted to use partitioning for this task but the problem is data:

*   18S: whole genome downloading (too big files) 33/215
*   28S: whole genome downloading (too big files) 88/215
*   COI: only one which is presented for all species
*   ITS1: 16/215
*   Cytb: 62/215 mitochondrion, complete genome

---

**No partitioning then** 😞

## **General tree reconstruction**

**Steps to achive our goal "Article reproduction":**

1. Download all fasta for gene COI ("Downloading_data" notebook)
2. Sequnces alignment (MAFFT)
3. Check the alignments (Ugene)
4. Using Trimal for trimming the alignments if necessary (and it's necessary)
5. Check outgroup IDs
6. Use IQTree to get the tree (outgroup!)
7. Rename the tree ("Downloading_data" notebook)
8. Visualize the tree

Additional part:

*   Get the metadata
*   To visualize the tree add the metadata
*   Explain rearrangements

## **Step 1. Alignment + trimming**

We are going to use mafft for sequence alignment and Trimal to trimm it then.

> Since we have complete mitochondrial genomes instead of one COI gene for some species this part with trimming is necessary.

In [1]:
%%capture
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local

**1.1 Sequnces alignment**

In [2]:
%%capture
!apt-get install mafft

In [4]:
!mafft --auto COI_sequences.fasta > COI_aligned.fasta

nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..

There are 10 ambiguous characters.
    1 / 215  101 / 215  201 / 215
done.

Constructing a UPGMA tree (efffree=0) ... 
    0 / 215   10 / 215   20 / 215   30 / 215   40 / 215   50 / 215   60 / 215   70 / 215   80 / 215   90 / 215  100 / 215  110 / 215  120 / 215  130 / 215  140 / 215  150 / 215  160 / 215  170 / 215  180 / 215  190 / 215  200 / 215  210 / 215
done.

Progressive alignment 1/2... 
STEP    86 / 214  f
Reallocating..done. *alloclen = 32988
STEP   214 / 214  f
done.

Making a distance matrix from msa.. 
  200 / 215
done.

Constructing a UPGMA tree (efffree=1) ... 
  210 / 215
done.

Progressive alignment 2/2... 
STEP   189 / 214  f
Reallocating..done. *alloclen = 32863
STEP   214 / 214  f
done.

disttbfast (nuc) Version 7.490
alg=A, model=

**1.2 Alignment trimming**

In [5]:
!conda create -n trimal

Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done
Solving environment: - done

## Package Plan ##

  environment location: /usr/local/envs/trimal



Proceed ([y]/n)? y

Preparing transaction: | done
Verifying transaction: - done
Executing transaction: | done
#
# To activate this environment, use
#
#     $ conda activate trimal
#
# To deactivate an active environment, use
#
#     $ conda deactivate



In [6]:
!source activate trimal && conda install trimal -c conda-forge -c bioconda

Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - done

## Package Plan ##

  environment location: /usr/local/envs/trimal

  added / updated specs:
    - trimal


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            2_gnu          23 KB  conda-forge
    libgcc-14.2.0              |       h77fa898_1         829 KB  conda-forge
    libgcc-ng-14.2.0           |       h69a702a_1          53 KB  conda-forge
    libgomp-14.2.0             |       h77fa898_1         450 KB  conda-forge
    libstdcxx-14.2.0           |       hc0a3c3a_1         3.7 MB  con

In [7]:
# COI trimming with -gt 0.1 param
!source activate trimal && trimal -in COI_aligned.fasta -out COI_trimmed.fasta -gt 0.1

In [None]:
!mafft --add /content/modified_mt.fa --reorder /content/COI_trimmed.fasta > /content/add_trimmed.fasta

> We lost this one after trimming "*DQ078424.1 Melinaea marsaeus*". and that's interesting cause its not inverted or smth seems like just lost the COI part in it.

## **Step 2. Tree reconstruction**

We are going to use IQTree for this and the outgroup (Trichoptera):

* HM902392.1
* OP817850.1
* MZ629096.1
* MZ628664.1
* MZ627614.1

In [8]:
!conda create -n iqtree

Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | done
Solving environment: - done

## Package Plan ##

  environment location: /usr/local/envs/iqtree



Proceed ([y]/n)? y

Preparing transaction: | done
Verifying transaction: - done
Executing transaction: | done
#
# To activate this environment, use
#
#     $ conda activate iqtree
#
# To deactivate an active environment, use
#
#     $ conda deactivate



In [9]:
!source activate iqtree && conda install iqtree -c conda-forge -c bioconda

Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
Solving environment: / - done

## Package Plan ##

  environment location: /usr/local/envs/iqtree

  added / updated specs:
    - iqtree


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    iqtree-2.3.6               |       hdbdd923_0         4.0 MB  bioconda
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
  iqtree             bioconda/linux-64::iqtree-2.3.6-hdbdd923_0 
  libgcc             conda-forge/linux-64::libgcc-14.2.0-h77fa898_1 
  libgcc-ng          conda

In [10]:
!source activate iqtree && iqtree -s COI_trimmed.fasta -m TEST -bb 1000 -nt AUTO -o HM902392.1,OP817850.1,MZ629096.1,MZ628664.1,MZ627614.1

IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug  4 2024
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong

Host:    cbb8eb040054 (AVX2, FMA3, 12 GB RAM)
Command: iqtree -s COI_trimmed.fasta -m TEST -bb 1000 -nt AUTO -o HM902392.1,OP817850.1,MZ629096.1,MZ628664.1,MZ627614.1
Seed:    724479 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Sun Oct 27 13:49:54 2024
Kernel:  AVX+FMA - auto-detect threads (2 CPU cores detected)

Reading alignment file COI_trimmed.fasta ... Fasta format detected
Reading fasta file: done in 0.00446346 secs using 86.46% CPU
Alignment most likely contains DNA/RNA sequences
Alignment has 215 sequences with 658 columns, 524 distinct patterns
344 parsimony-informative, 162 singleton sites, 152 constant sites
             Gap/Ambiguity  Composition  p-value
Analyzing sequences: done in 0.000290618 secs using 481.4% CPU
   1  HM902392.1     0.00%

## **Step 3.5 Rename the species**

Go to "Downloading_data" colab.

## **Step 4. Tree visualisation**

We are going to use [ITOl](https://itol.embl.de/) to do it.