# ***<font color="RoyalBlue">SNP Analysis Additional Processing (SNPclipper/SNPcaster_gubbins) Notebook</font>***

<br>
<br>

***

# ***<u><font color="RoyalBlue">Analysis Procedure</font></u>***

<font color="DarkOrange">\* This notebook explains how to perform additional analyses on the results of a previously run SNPcaster by changing conditions.</font></br>
<font color="DarkOrange">\* If you have not yet run SNPcaster, please execute <b>SNPcaster_quickstart.ipynb</b> or <b>SNPcaster.ipynb</b> first.</font>

## 1. Run SNPcaster

Please execute <b>SNPcaster_quickstart.ipynb</b> or <b>SNPcaster.ipynb</b> first if you have not yet run SNPcaster.

## 2. Select the program to run

From the list below, select the analysis you want to perform and go to the corresponding section.</br>
#### - To see results after removing recombinogenic regions detected by Gubbins → <font color="Lime"> <b>Go to 3</b></font><br>
#### - To see results when changing the conditions of clustered SNPs or masking → <font color="Lime"><b>Go to 4</b></font>

## 3. Additional execution of Gubbins

### 3-1. How to run

You can run Gubbins by specifying the SNPcaster output folder that has not yet been processed by Gubbins.

#### Options

    *Optional. If not specified, default values will be used.

|Parameter|Description|
|---|---|
|-i|Target SNPcaster output folder for additional Gubbins execution|
|-t*|Number of threads (default is 8)|

<font color="Tomato">────────────── 　↓↓↓ 　***Command Execution***　 ↓↓↓　 ──────────────</font>

In [None]:
####################################################
# Set parameters
####################################################
# SNPcaster output folder
input ='snpcaster_20240115_094704_list_test'
# Number of threads
threads = 8

####################################################
# Run snpcaster_gubbins.sh
####################################################
!bash snpcaster_gubbins.sh -i $input -t $threads
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>
<br>

### 3-2. Output

The specified folder will contain the Gubbins-processed results: <b>4_results_gubbins</b> and <b>5_results_with_gubbins</b>.<br>
\* For details of the output files, see <b>SNPcaster.ipynb section 9.2</b>.<br>

## 4. Run SNPclipper

SNPclipper generates new results from a previously run SNPcaster when you change the conditions of clustered SNPs or masking.
<br>
Specifically, it regenerates <b>3_results_without_gubbins</b> and <b>5_results_with_gubbins</b>.

### 4-1. How to run

You can run SNPclipper by specifying the SNPcaster output folder and the conditions for clustered SNPs or masking.

#### Options

    *Optional. If not specified, default values will be used.
    * "clustered SNP" refers to regions of recombination that are removed from SNPs.

|Parameter|Description|
|---|---|
|-i|SNPcaster output folder|
|-c*|Interval of adjacent SNPs (cluster SNP) to delete (0 means no deletion, default is 0)|
|-d*|Mask file (if not specified, no masking is performed)|
|-t*|Number of threads (default is 8)|

<font color="Tomato">────────────── 　↓↓↓ 　***Command Execution***　 ↓↓↓　 ──────────────</font>

In [None]:
####################################################
# Set parameters
####################################################
# SNPcaster output folder
input ='snpcaster_20240115_094704_list_test'
# Interval of adjacent SNPs (clustered SNP) to delete
cluster = 0
# Number of threads
threads = 8
# If you want to apply a mask, remove the leading # and insert the file path between ""
#mask = "mask.tsv"

####################################################
# Run SNPclipper (do not modify the below)
####################################################
extra_options = ""
extra_options += f"-d {mask} " if 'mask' in locals() and mask else ""

!bash snpclipper.sh \
    -i $input \
    -c $cluster \
    -t $threads $extra_options
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>
<br>

### 4-2. Output

A folder named <b>snpclipper_[execution datetime]</b> will be created in the specified folder,
and it will contain the results <b>3_results_without_gubbins</b> and <b>5_results_with_gubbins</b> processed under the specified conditions.<br>
\* <b>5_results_with_gubbins</b> is generated only if Gubbins was also executed.</br>
\* If you also want to run Gubbins, perform <b>3. Additional Gubbins execution</b>.</br>
\* For details of the output files, see <b>SNPcaster.ipynb section 9.2</b>.<br>

## 5. Analyses using SNPs

### 5.1. Creating a Haplotype Network Diagram

Open one of the following files with PopART for analysis.
<br>
- 3_results_without_gubbins
    - final_snp.nex
    - final_snp_woRef.nex
- 5_results_with_gubbins
    - final_snp_with_gubbins.nex
    - final_snp_with_gubbins_woRef.nex
<br>
\* Use the files with "woRef" when you want a diagram without the reference sequence.
<br>

### 5.2. Building a Phylogenetic Tree

Use one of the following files (the 5 set is generated only when Gubbins is executed).
- 3_results_without_gubbins
    - final_snp.fasta
    - final_snp_woRef.fasta
- 5_results_with_gubbins
    - final_snp_with_gubbins.fasta
    - final_snp_with_gubbins_woRef.fasta

#### Use RAxML-NG → <font color="Lime"><b>Go to 6.1.1.</b></font>
#### Use IQTREE → <font color="Lime"><b>Go to 6.1.2.</b></font>

### 5.3. Calculating SNPs Between Strains

The number of SNPs between strains is listed in the following files and can be opened with Notepad, Excel, etc.
- 3_results_without_gubbins
    - dist_final_snp.tsv
    - dist_final_snp_matrix.tsv 
- 5_results_with_gubbins
    - dist_final_snp_without_recombination.tsv
    - dist_final_snp_matrix_without_recombination.tsv

## 6. Building a Phylogenetic Tree

Using the obtained SNP files, construct a maximum-likelihood phylogenetic tree with either RAxML-NG or IQTREE.<br>
* Change the name of the reference sequence (default is Ref) if necessary.
* When building a tree without the reference sequence, use the files with "woRef".

### 6.1. Maximum-Likelihood Tree Construction

\* Both RAxML-NG and IQTREE can be used.

#### 6.1.1 RAxML-NG

ModelTest-NG selects the optimal nucleotide substitution model, which is then used by RAxML-NG to build the maximum-likelihood tree.

- In the cell below, enter the path to the SNP file after "input=".
- Example: 'snpcaster_20240115_094704_list.txt/3_results_without_gubbins/final_snp.fasta'<br>
    ↑<u><font color="Red">The "20240115_094704_list.txt" part varies with each analysis.</font></u>

- Options

|Parameter|Required|Description|
|---|---|---|
|input|●|Input file (e.g., xxx.fasta).|
|threads|●|Number of threads for modeltest-ng (e.g., 12). RAxML-NG selects threads automatically.|
|bootstrap|-|Bootstrap (default 1,000)|

<font color="Tomato">────────────── 　↓↓↓ 　***Command Execution***　 ↓↓↓　 ──────────────</font><br>
\* Log can be found in `raxml-ng.log` in the execution directory. (It is overwritten each run.)

In [None]:
####################################################
# Set parameters
####################################################
input='snpcaster_20240115_094704_list.txt/3_results_without_gubbins/final_snp.fasta'
# Uncomment and edit the line below if you want to use the results from a Gubbins run.
# input = "snpcaster_20240115_094704_list.txt/5_results_with_gubbins/final_snp_after_gubbins.fasta"
threads=8
bootstrap=1000

####################################################
# Run raxml-ng
####################################################
!bash raxml-ng.sh $input $threads $bootstrap > raxml-ng.log
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>

#### 6.1.2 IQTREE Execution

IQTREE automatically selects the nucleotide substitution model and builds a maximum-likelihood tree using the optimal model.

- In the cell below, enter the path to the SNP file after "input=".
- Example: 'snpcaster_20240115_094704_list.txt/3_results_without_gubbins/final_snp.fasta'<br>
    ↑<u><font color="Red">The "20240115_094704_list.txt" part varies with each analysis.</font></u>

- Options

|Variable|Required|Description|
|---|---|---|
|input|●|Input file (e.g., xxx.fasta).|
|bootstrap|-|Ultrafast bootstrap replicates (≥1,000, default 1,000)|

- Output
  The generated iqtree folder (`iqtree_results_<date>_<time>`) will contain the results.

<font color="Tomato">────────────── 　↓↓↓ 　***Command Execution***　 ↓↓↓　 ──────────────</font><br>
\* Verify that the `input` variable points to a valid file path.</br>
\* Log can be found in `iqtree.log` in the execution directory. (It is overwritten each run.)

In [None]:
####################################################
# Set parameters
####################################################
input='snpcaster_20240115_094704_list.txt/3_results_without_gubbins/final_snp.fasta'
# Uncomment and edit the line below if you want to use the results from a Gubbins run.
# input = "snpcaster_20240115_094704_list.txt/5_results_with_gubbins/final_snp_after_gubbins.fasta"
bootstrap=1000

####################################################
# Run iqtree
####################################################
!bash iqtree.sh $input $bootstrap > iqtree.log
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>

### 6.2. Output location for phylogenetic tree files

**The analysis folder** will contain the following tree result folders and files:

- RAxML-NG:
  - Folder name: `raxml_results_<date>_<time>`
  - File: `final_snp_without_recombination.fasta_bootstrap.nwk` (bootstrap-annotated tree)
- IQTREE:
  - Folder name: `iqtree_results_<date>_<time>`
  - File: `final_snp_without_recombination.fasta.contree` (bootstrap-annotated tree)

### 6.3. Verifying phylogenetic tree files

Open the tree files with MEGA, CLC Genomics Workbench, etc.<br>
In MEGA, you can save images via Image → Copy to Clipboard. To save images with the root, select the root line, click "Place Root on Branch", and use "Show Subtree Separately".
<br>