# Check the quality of FASTQ files

This is a more practical lesson, where we will see how we can check the quality of FASTQ files and process them to obtain raw count matrices that can be uploaded to Trailmaker.
For this lesson, we will use small example FASTQ files located in the [shared Google Drive folder named "fastq_files"](https://drive.google.com/drive/folders/1MU_HS86zPuOoftUBSD_jtQWpB-awUQ-O?usp=drive_link).

To check the quality of our FASTQ files, we are going to use a tool called FastQC. The following code will download and install [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

In [4]:
## Install FastQC ##

## Install java runtime environment, which is required by FastQC
# can use ! to run shell commands inside notebook, but requires sudo
# run in separate terminal instead:
# sudo apt-get install -y openjdk-8-jre-headless perl

## download the most recent version of FastQC
!wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
# sudo apt install unzip
!unzip fastqc_v0.12.1.zip

--2025-02-06 17:32:28--  https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
Resolving www.bioinformatics.babraham.ac.uk (www.bioinformatics.babraham.ac.uk)... 149.155.133.4
connected. to www.bioinformatics.babraham.ac.uk (www.bioinformatics.babraham.ac.uk)|149.155.133.4|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 11709692 (11M) [application/zip]
Saving to: ‘fastqc_v0.12.1.zip.1’


2025-02-06 17:32:31 (3.65 MB/s) - ‘fastqc_v0.12.1.zip.1’ saved [11709692/11709692]

/bin/bash: line 1: unzip: command not found


In [None]:
!
!unzip fastqc_v0.12.1.zip

In [None]:
!chmod +x FastQC/fastqc

Let's create a folder that will contain the output from FastQC.

In [None]:
%mkdir fastqc_output

Let's run FastQC on all our FASTQ files.

In [None]:
!/content/gdrive/MyDrive/trailmaker_course/FastQC/fastqc -t 8 -f fastq -o ./fastqc_output/ /content/gdrive/MyDrive/trailmaker_course/FASTQ_processing/fastq_files/*.gz

When FastQC has finished running, we get an html report that we can examine to identify potential issues.

For details on the different modules of FastQC and the most common reasons for warnings and errors, check out [this link](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/).

In this lesson, we used small example FASTQ files to demonstrate how to process a file quickly. However, because these are artificial files, the FastQC report will indicate poor quality. To illustrate the output of high-quality FASTQ files, in this lesson we present an example of FastQC report from real, good-quality FASTQ files. You will find this report in the course PDF material for this lesson.

# Process fastq files


To process FASTQ files we are going to use a tool called [kallisto|bustools](https://www.kallistobus.tools/). This will basically align the reads to the human reference genome, and create gene counts from the reads to generate a *cell x gene* matrix.

Install [`kb-python`](https://www.kallistobus.tools/kb_usage/kb_usage/#kallisto-and-bustools)

In [None]:
%pip install kb-python


## Build a reference
Reference files can be downloaded from [this link](https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest).
In our case, the data is from human, so we will download the human reference.

NOTE: If the species is not human, you should download the corresponding transcriptome reference, t2g and gtf files, from your favorite source.

In [None]:
%cd /content/gdrive/MyDrive/trailmaker_course
!mkdir ./human-ref

In [None]:
species = "human"

!curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
!tar -xvf refdata-gex-GRCh38-2020-A.tar.gz --directory /content/gdrive/MyDrive/trailmaker_course/human-ref
!rm refdata-gex-GRCh38-2020-A.tar.gz

Build the reference:

In [None]:
# specify directories/files
ref_dir="/content/gdrive/MyDrive/trailmaker_course/human-ref/refdata-gex-GRCh38-2020-A/"
transcriptome = ref_dir + "transcriptome.idx"
t2g = ref_dir + "t2g.txt"
cdna = ref_dir + "cdna.fa"
genome= ref_dir + "fasta/genome.fa"
gtf= ref_dir + "genes/genes.gtf"

# run kb ref
!kb ref -i $transcriptome -g $t2g -f1 $cdna $genome $gtf

## Generate a raw count matrix

To generate the count matrices for each sample, we have to run `kb count`. This tool needs several arguments to run:
1. `-i`: the index file, generated by `kb ref`.
2. `-g`: the gtf file.
3. `-x`: the technology.
4. `-o`: the output folder.
5. The fastq files. Should be an even number of files, and the same amount of R1's and R2's. You should add all R1's and R2's corresponding to the sample separated by spaces (or \ to break the line). And they should be ordered R1 then R2, then R1... etc.

This needs doing for each sample. In this case we have only one sample with 2 fastq files.
Fastq files can be found in the folder "trailmaker_course/fastq_files/"

In [None]:
# specify directories/files
ref_dir="/content/gdrive/MyDrive/trailmaker_course/human-ref/refdata-gex-GRCh38-2020-A/"
transcriptome = ref_dir + "transcriptome.idx"
t2g = ref_dir + "t2g.txt"
cdna = ref_dir + "cdna.fa"
genome= ref_dir + "fasta/genome.fa"
gtf= ref_dir + "genes/genes.gtf"


# run kb count
!kb count -i $transcriptome -g $t2g -x "SPLIT-SEQ" -o /content/gdrive/MyDrive/trailmaker_course/pbmc_1k_kbcount_output \
/content/gdrive/MyDrive/trailmaker_course/FASTQ_processing/fastq_files/pbmc_3Mreads_S1_R1.fastq.gz  \
/content/gdrive/MyDrive/trailmaker_course/FASTQ_processing/fastq_files/pbmc_3Mreads_S1_R2.fastq.gz \

# keep adding these lines for the extra samples.

## Convert to files compatible with Trailmaker

Lastly, we have to convert the kallisto bustools output to count matrices files that can be uploaded to Trailmaker Insights (one folder per sample, each with barcodes/features/matrix files). For that we have an R script that does it automatically. The convert_kbout_to_matrices.R file is inside the same folder as the notebook. Running the following cell should take care of things, but in case there are issues, the simplest solution is to run the contents of the script interactively in Rstudio.

In [None]:
%cd /content/gdrive/MyDrive/trailmaker_course
!Rscript /content/gdrive/MyDrive/trailmaker_course/FASTQ_processing/convert_kbout_to_matrices.R "human"

The files generated (you should find them in the folder "trailmaker_course/pbmc_1k") are ready to be uploaded to Trailmaker!