## NGS alignment exercise

WXS_example_* and WXS_example_cancer_* are next-generation sequencing (NGS) data from the blood and cancer tissue of a colorectal cancer patient^, respectively.

Next week, we will learn to call single nucleotide variants from these files. Our objective this week is just to align them. Please answer the following questions as you proceed to align the file and note down your Linux commands used in a separate document.

(1)	Why are there two files for each sample (i.e. WXS_example_1.fq.gz and WXS_example_2.fq.gz)?

(2)	How long are the reads in WXS_sample_1.fq.gz?

(3)	How many reads are in each sample?

(4)	Now align the file. How many properly paired reads were aligned in WXS_example and WXS_example_cancer, respectively?

(5)	View the files in IGV. What regions of DNA are most reads covering?

(6)	What is the coverage at the transcription start site of MSH6?


First setup Colab session to do exercise

In [1]:
# Set working pathway to your own google drive doc (~ 1 min)
from google.colab import drive
drive.mount('/content/gdrive')                         # if using for the first time, you be requested to grant permission to link your Google Drive

import os
try:
  os.mkdir("/content/gdrive/My Drive/PB_course")         # change this path if necessary
except FileExistsError:
  print("directory already exist. OK to continue")
os.chdir("/content/gdrive/My Drive/PB_course")

Mounted at /content/gdrive
directory already exist. OK to continue


In [2]:
# Install conda (~ 1 min). There will be a message saying that the session has crashed but don't worry about this. This is due to the session restarting following conda installation
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:15
🔁 Restarting kernel...


In [None]:
# Install samtools (~1 min)
!conda install -c bioconda samtools

In [None]:
# Install bwa (~ 1 min)
!conda install -c bioconda bwa

In [None]:
# Install igv-notebook (< 1 min)
!pip install igv-notebook

In [None]:
# Download reference sequence
# Double check that we are in the right directory (~ 30s)
import os
os.chdir("/content/gdrive/MyDrive/PB_course")                     # change this path if necessary

import os
if os.path.isfile("/content/gdrive/MyDrive/PB_course/DB_trunc/chr2.fa"):    # check if the file exist
  print("reference file already exist, OK to continue.")
else:
  !pip install gdown
  !gdown https://drive.google.com/uc?id=1aRJVznjy5WLQ5Dc0DT9c6NiXw64HdoKr # download if file not exist
  # unzip fasta file
  !unzip DB_trunc.zip
  # remove the zip file after extraction
  !rm DB_trunc.zip

!ls -l ./DB_trunc/

In [1]:
# Run this cell to download the WXS files
import os
os.chdir("/content/gdrive/My Drive/PB_course")

import os
if os.path.isfile("/content/gdrive/MyDrive/PB_course/Datasets/WXS_example_cancer_1.fq.gz"):    # check if the file exist
  print("file already exist, OK to continue.")
else:
 !wget -O Datasets_GXS.zip https://github.com/jasonwong-lab/HKU-Practical-Bioinformatics/raw/main/files/Datasets_GXS.zip
 !unzip -o Datasets_GXS.zip   # unzip file
 !rm Datasets_GXS.zip
 !wget -O Datasets/WXS_example_cancer_1.fq.gz https://github.com/jasonwong-lab/HKU-Practical-Bioinformatics/raw/main/files/WXS_example_cancer_1.fq.gz
 !wget -O Datasets/WXS_example_cancer_2.fq.gz https://github.com/jasonwong-lab/HKU-Practical-Bioinformatics/raw/main/files/WXS_example_cancer_2.fq.gz


# Check what files we have now
%cd /content/gdrive/MyDrive/PB_course/Datasets/
!ls -l

--2024-10-25 18:13:29--  https://github.com/jasonwong-lab/HKU-Practical-Bioinformatics/raw/main/files/Datasets_GXS.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jasonwong-lab/HKU-Practical-Bioinformatics/main/files/Datasets_GXS.zip [following]
--2024-10-25 18:13:29--  https://raw.githubusercontent.com/jasonwong-lab/HKU-Practical-Bioinformatics/main/files/Datasets_GXS.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24016057 (23M) [application/zip]
Saving to: ‘Datasets_GXS.zip’


2024-10-25 18:13:32 (38.3 MB/s) - ‘Datasets_GXS.zip’ saved [24016057/24016057]

Archive:  Datasets_GXS.zip


Now can start the exercise.
The answers are on Moodle.
Type in the code yourself to get familiar with Colab.

**Q1 Why are there two files for each sample (i.e. WXS_example_1.fq.gz and WXS_example_2.fq.gz)?**

**Q2 How long are the reads in WXS_sample_1.fq.gz?**  
hints: zcat

**Q3 How many reads are in each sample?**  
hints: zcat + wc -l

**Q4 Now align the file. How many properly paired reads were aligned in WXS_example and WXS_example_cancer, respectively?**  
hints: bwa mem + samtools flagstat

**Q5 View the files in IGV. What regions of DNA are most reads covering?**

In [None]:
# First prepare files for IGV


In [None]:
# Then load the tracks in IGV

**Q6 What is the coverage at the transcription start site of MSH6?**  
hints: use the IGV app to find answer