Skip to content

nabiafshan/tutorials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

What in the world is a FASTQ file?

FASTQ files contain unmapped sequence reads. These are the initial files we get back from a sequencing facility.

For each sequence read the FASTQ file contains 4 lines of information:

Line Meaning
1 Begins with @, contains information about read (like unique identifier, or a descriptor like sequence length)
2 The nucleotide sequence of the read
3 Begins with + and sometimes has same information as line 1
4 A string of characters (of same length as line 2) that represent the quality (probability that corresponding nucleotide in line 2 is correct)

Let's check our toy example and look for this information.

What's the deal with the weird looking quality scores?

Using these character codes, we have one character per nucleotide (instead of numbers like 0 to 40). This reduces file size.

 Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
                   |         |         |         |         |
    Quality score: 0........10........20........30........40

Side note to break the flow: This mapping corresponds to Phred-33. There also exists a Phred-66 mapping, but apparently it is not used as frequently in new NGS data src.

Uh, so each character stands for a number. What does the number mean?

The number is the quality score Q. It can be converted into the probability that the nucleotide called is wrong P using the following formula.

Q = -10 x log10(P), where P is the probability that a base call is erroneous

Basically, this just means that:

Phred Quality score Q Probability of incorrect base call P Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%

So, the higher the quality score Q corresponding to quality encoding character is, the more we are confident that the character at a given position in the read is accurate.

Sources (and for more details, see): Wikipedia on FASTQ format and this tutorial.

Step-by-step tutorial to running fastqc on the HPC

Super awesome useful link: http://hpc.sabanciuniv.edu/

Contains information on how to get started with HPC. Highly recommend going through it! The information here is a subset of that contained in this guide.

  1. Log in to HPC: For mac & linux, open terminal and write this: ssh username@10.39.60.250. If the username is correct, you'll be asked to enter your password. Both the username and password should also be in the SABANCI HPC Cluster User Information email sent to you by someone from Compecta. Windows users, see the guide.

  2. (Option 1) Clone repo with data from github: In HPC, clone repository with data using: git clone https://github.com/nabiafshan/fastqc-hpc-tutorial.git

           OR

  1. (Option 2) Copy file from your computer to HPC: for mac & linux, open terminal and scp SP1.fq username@10.39.60.250:~/workfolder. (Replace username and write password.) This will copy the file SP1.fq from your current directory to your workfolder on HPC. Windows users, see the guide. This StackExchange answer has some useful information on scp.

  2. View available modules: Use module avail. Can also use module avail fastqc to find whether fastqc is available.

  3. Get example job script from /cta/share/: Set directory to /cta/share/ like this: cd /cta/share/. Copy the slurm_example.sh file to your workfolder: scp slurm_example.sh ~/workfolder/fastqc-hpc-tutorial/.

  4. Modify slurm_example.sh: Go to directory with the data and open slurm_example.sh using an editor: e.g. nano slurm_example.sh. Scroll down to the #Module File comment and add this line to load the fastqc module: module load fastqc-0.11.7-gcc-8.2.0-43xnlgy. Add this line to run fastqc using the SP1.fq example file: fastqc SP1.fq. Save and exit editor.

  5. Submit your job: Use sbatch slurm_example.sh

  6. Copy results to your computer: Use scp username@10.39.60.250:~/workfolder/fastqc-hpc-tutorial/SP1_fastqc.zip . (from terminal on your machine!). Change username and enter your password. File should be in the directory you're in. Unzip and view the html file.

Other recommended links

  • Official website for fastqc. Super useful to look at example good and bad results.
  • Fastqc manual explains what the different analysis modules mean (see Section 3).

Releases

No releases published

Packages

No packages published