# ***<font color="RoyalBlue">Assembly and Data QC (grape_qc_assembly)</font>***

Performs coverage calculation, assembly (SPAdes or SKESA), and quality check (QUAST, checkM).

<br>
<br>

---

# ***<u><font color="RoyalBlue">Analysis Steps</font></u>***

## 1. Upload Input Files

Upload FASTQ files to your analysis project folder.
You can upload files from the ⬆︎ button at the top of the left sidebar, or by dragging and dropping files into the analysis project folder.
<br>


The file extension must be `fastq.gz`.<br>
The file name should begin with the strain name.<br>
For example, the following formats are supported:

- Illumina format (e.g., strain_S00_L001_R1_001.fastq.gz)
- Simple format (e.g., strain_R1.fastq.gz)

## 2. Create a Strain List

Create a strain list using one of the following methods:

### 2.1. Create in Jupyter Lab
- Click the "+ " button at the top of the left sidebar → Click "Text file" under "Other" to create a new file.
- After editing is complete, press Ctrl+s (Mac: Command+s) to save the file with a name.
### 2.2. Create with Notepad, Text Editor, Vim, etc. (any text editor) and copy it to the analysis folder.
- Creating on Windows may not work correctly (due to line ending differences).
- Create on Linux or Mac.
### 2.3. Create using the command below.
- After "file_name = ", enter the list name enclosed in double quotes (e.g., "filename").
- After "user_input = (""", enter the strain list (one strain name per line).

<font color="Tomato">────────────── ↓↓↓ ***Execute Command*** ↓↓↓ ──────────────</font>

In [None]:
# Enter the filename for the list
file_name = "list.txt"

# Paste the strain list
user_input = ("""
A0001
A0002
A0003
A0004
A0005
A0006
A0007
A0008
"""
)

####################################################
# **Do not modify the following**
####################################################

# Remove leading and trailing newlines, then add a newline
user_input = user_input.strip("\n") + "\n"

# Write the input elements to the file
with open(file_name, "w") as file:
    file.write(user_input)
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>
<br>

## 3. Create a FASTQ List

Create a list of FASTQ file names (short-read data file names) associated with each strain in the strain list created in step 2. Follow the procedure below.

### 3.1. Run find_strain_pairs.py
- After "file_name = ", enter the strain list name created in the previous step, enclosed in double quotes.
- In "Set Parameters" section, modify any items if needed and execute. You can also execute with the content as shown.

<font color="Tomato">────────────── ↓↓↓ ***Execute Command*** ↓↓↓ ──────────────</font>

In [None]:
import subprocess

####################################################
# Set Parameters
####################################################
# Enter the filename of the strain list created in step 2
file_name = "list.txt"
# Enter the file extension for FASTQ files
file_extension = "fastq.gz"
# Filename for the FASTQ list to be created
fastq_list_name = "list_fastq.tsv"
# Filename for the list of strains with no paired FASTQ files found
unpaired_list = "unpaired_fastq.tsv"

####################################################
# Run find_strain_pairs.py
# **Do not modify the following**
####################################################
command = [
    "find_strain_pairs.py", file_name,
    "--file_extension", file_extension,
    "--paired_list", fastq_list_name,
    "--unpaired_list", unpaired_list
]
subprocess.run(command, capture_output=False, text=True)
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>
<br>

### 3.2. Verify and Modify the FASTQ List
- Open and verify `list_fastq.tsv` (FASTQ list) to confirm that each strain name has two FASTQ files (Read1 and Read2) correctly listed.
- If `unpaired_fastq.tsv` is generated, open and verify it to identify strains where FASTQ file pairs were not found.
    - Copy the strain name to `list_fastq.tsv` and add the paired FASTQ files separated by tabs.
- Below is an example of the FASTQ list format (columns are tab-separated, line endings are \n).

```plaintxt
strain1	strain1_R1.fastq.gz	strain1_R2.fastq.gz
strain2	strain2_R1.fastq.gz	strain2_R2.fastq.gz
```

## 4. Create a Configuration File<br>

<b>If QC is not necessary, Proceed to Step 5.</b><br>
<br>

If you want to perform QC, create a configuration.txt file as follows:<br>

There is a file called `stec/configuration.txt` in the `sample_data` folder within the folder where the `CreateProject.ipynb` file used in Create Project is located. This is a sample configuration for STEC. Copy it to your project folder and edit it as appropriate for the bacterial species you are analyzing.<br>

In configuration.txt, you can specify genome size and required coverage to determine if the data amount is sufficient.<br>
Genome size varies by bacterial species, so you need to specify it appropriately.<br>
Items that do not require checking can be deleted, and no check will be performed on them.<br>
<br>
Below is an example for STEC.<br>
<br>
Genome size (Mbp):&nbsp;&nbsp;5.5 <br>
Coverage threshold:&nbsp;&nbsp;40 <br>
Maximum number of contigs:&nbsp;&nbsp;1000 <br>
Minimum genome size (Mbp):&nbsp;&nbsp;4 <br>
Maximum genome size (Mbp):&nbsp;&nbsp;6.5 <br>
Minimum completeness (%):&nbsp;&nbsp;98 <br>
Maximum contamination (%):&nbsp;&nbsp;2 <br>

## 5. Run qc_assembly (Data Check and Assembly)

Depending on your environment, choose whether to run qc_assembly in 5.1 or 5.2.

### 5.1. When Machine Memory is 40GB or More

Change the list part in the cell below (the part enclosed in double quotes in "list = \"list_fastq.tsv\"") to the name of the list you created and execute it.

|Variable|Parameter|Required|Description|
|---|---|---|---|
|list|-i|●|The filename of the FASTQ list created in step 4.|
|assembler|-a|-|Assembler to use [p:SPAdes, k:SKESA] (p by default)|
|config|-c|-|Configuration file for quality thresholds during grape analysis. If not specified, quality assessment will not be performed.|
|cleanup|-u|-|Whether to delete intermediate files. [1: cleanup, 0: not cleanup] (1 by default)|
|cutoff length|-L|-|Remove contigs shorter than the specified length. (500 by default)|
|fastp|-p|-|Whether to perform quality trimming with fastp (remove low-quality bases). [1:run, 0:not run] (1 by default)|
|scaffold|-s|-|Whether to create scaffold sequences. Only effective when SPAdes is specified as the assembler. [1: create, 0: not create] (0 by default)|
|threads|-t|-|Number of threads (8 by default)|

<br>

<font color="Tomato">────────────── ↓↓↓ ***Execute Command*** ↓↓↓ ──────────────</font><br>
\* For log information, refer to `grape_qc_assembly.log` in the execution directory. （<u><font color="Red">***This file is overwritten each time you run.***</u></font>）

In [None]:
####################################################
# Set Parameters
####################################################
# Strain list
list = "list_fastq.tsv"
# Configuration file
# If you want to perform quality assessment, create a quality configuration file and specify its filename below.
config = "" 
# config = "configuration.txt"
# Assembler: Enter "p" to use SPAdes, "k" to use SKESA
assembler = "p"
# Contig threshold: Contigs shorter than this value (bp) will be removed
cutoff = 500
# Enter "1" to perform trimming with fastp, "0" to skip
fastp = 1
# Enter "1" to create scaffolds, "0" to skip
scaffold = 0
# Number of threads
threads = 8
# Enter "0" to keep intermediate files like trimmed fastq, "1" to delete them
cleanup = 1

####################################################
# Run grape_qc_assembly (Do not modify)
####################################################
# Run grape_qc_assembly
config_option = f"-c {config} " if config else ""

!bash grape_qc_assembly.sh \
    -i $list \
    -a $assembler \
    -L $cutoff \
    -p $fastp \
    -s $scaffold \
    -t $threads \
    -u $cleanup $config_option\
    > grape_qc_assembly.log
!echo 'Complete!'

### 5.2. When Machine Memory is 40GB or Less

Change the list part in the cell below (the part enclosed in double quotes in "list = \"list_fastq.tsv\"") to the name of the list you created and execute it.<br>
\* For details on each parameter, see `5.1. When Machine Memory is 40GB or More`.

In [None]:
####################################################
# Set Parameters
####################################################
# Strain list
list = "list_fastq.tsv"
# Configuration file
# If you want to perform quality assessment, create a quality configuration file and specify its filename below.
config = "" 
# config = "configuration.txt"
# Assembler: Enter "p" to use SPAdes, "k" to use SKESA
assembler = "p"
# Contig threshold: Contigs shorter than this value (bp) will be removed
cutoff = 500
# Enter "1" to perform trimming with fastp, "0" to skip
fastp = 1
# Enter "1" to create scaffolds, "0" to skip
scaffold = 0
# Number of threads
threads = 8
# Enter "0" to keep intermediate files like trimmed fastq, "1" to delete them
cleanup = 1

####################################################
# Run grape_qc_assembly (Do not modify)
####################################################
# Run grape_qc_assembly
config_option = f"-c {config} " if config else ""

!bash grape_qc_assembly.sh \
    -i $list \
    -a $assembler \
    -L $cutoff \
    -p $fastp \
    -s $scaffold \
    -t $threads \
    -u $cleanup $config_option \
    -r 1 \
    > grape_qc_assembly.log
!echo 'Complete!'

<font color="Tomato">────────────────────────────────────────────</font>

<br>

### 5. After execution, a folder named `qc_assembly_[date]_[time]_[list_name]` will be created. The files in the folder are as follows:<br>

- [strain_name].fasta: Draft genome
- assembly_summary.tsv: Summary of statistics such as contig lengths (calculated by QUAST).
- coverage.txt: Coverage list
- list_assembly.tsv: FASTQ file list for strains included in list_over_coverage.
- list_over_coverage: List of strains that exceeded the coverage specified in the configuration file. Only strains in this list will be assembled.
- qc_results.tsv: Results from QUAST and CheckM (genome size, contamination, etc.). Records "pass" if within the configuration file range, "fail" if outside. The qc_results (overall assessment) column shows "PASS" only when all values pass.
- qc_results.xlsx: Excel version of qc_results.tsv (same content).
- Data only retained when cleanup=0
  - fastp folder: FASTQ files trimmed by fastp
    - [strain_name]_1(2).fastq.gz: Read1 or 2 read files where pairs were found
    - [strain_name]_u1(u2).fastq.gz: Read1 or 2 read files where pairs were not found
  - quast folder: QUAST execution results
    - For details of output, see [https://github.com/ablab/quast#output](https://github.com/ablab/quast#output).
  - checkm folder: CheckM execution results
    - For details of output, see [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM).

Generally, data that passes QC is used for subsequent analysis (SNP analysis, etc.).<br>
However, data that does not pass quality checks may not be completely unusable. If there is no obvious contamination or similar issues, examine the data carefully.