### Supplementary Notebook 1
---
### Standard preprocessing of single-cell data

#### Overview of steps performed:

1. **Download the data**


2. **Raw read alignment** (5 options, A-E)

    A. CellRanger
    
    B. STAR
    
    C. BWA
    
    D. Kallisto Bustools
    
    E. Salmon
    
    
3. **Feature matrix generation using HTseq-counts**

### Step 1. Download the raw data

#### Download data: 10x Genomics

Now, we have a few options. For the 10x Genomics data, the easiest thing to do is to use *CellRanger*, which produces a convenient assembly of files and provides QC metrics to assess the quality of the data. *CellRanger* also produces the count matrices in convenient file format. *CellRanger* uses *STAR* to align to the genome. This particular sample will be aligned to the `hg38` (downloaded <a href="https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz">here</a>)

`curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz`

`tar -zxvf refdata-gex-GRCh38-2020-A.tar.gz`

For this example, we will first start by downloading the data. We're using data from 10x Genomics: <a href="https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3">5k Peripheral blood mononuclear cells (PBMCs)</a>

`wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar .`

`tar -xvf pbmc_1k_v3_fastqs.tar`

Now we have:

`
pbmc_1k_v3_fastqs/
    pbmc_1k_v3_S1_L001_I1_001.fastq.gz  
    pbmc_1k_v3_S1_L002_I1_001.fastq.gz
    pbmc_1k_v3_S1_L001_R1_001.fastq.gz  
    pbmc_1k_v3_S1_L002_R1_001.fastq.gz
    pbmc_1k_v3_S1_L001_R2_001.fastq.gz  
    pbmc_1k_v3_S1_L002_R2_001.fastq.gz
`

These `fastq` files are now stored in whatever directory the above command was executed. We can then execute the `cellranger count` function as described on their <a href="https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count">website</a>.

#### Download data: SmartSeq

For an example of data generated with ***SmartSeq*** technology, we will download this data from the <a href="https://www.ncbi.nlm.nih.gov/geo/">**GEO Database**</a>. We will use a dataset from the seminal paper, ***Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma*** [(Patel et al., 2015)](https://doi.org/10.1126/science.1254257)

After obtaining the list of accessions from GEO, we will execute the following code to download the data:

### Step 2. Raw read alignment

For the sake of instruction, we will not use CellRanger in this notebook but instead we will start by aligning the raw reads from the ***(Patel et al, 2015) dataset***. in the `fastq` files we just downloaded to the human genome (hg19), which can be downloaded from <a href="https://hgdownload.soe.ucsc.edu/downloads.html">UCSC Genome Browser</a>.

Generate the STAR-compatible reference genome files (only needs to be done once per reference genome)

Perform the alignment

Other tools not described in detail:

<a href="https://salmon.readthedocs.io/en/latest/index.html">**Salmon**</a> and <a href="https://www.kallistobus.tools/">**kallisto | bustools**</a> can generate feature matrices much faster than the above-used tools. These methods use a pseudoalignment approach. Additionally, <a href="http://bio-bwa.sourceforge.net/">**BWA**</a> can be used for low-divergence sequence alignment. The linked tutorials can be used to process feature matrices for downstream use with the below-mentioned analysis tools. 

### Step 3. Read counting and feature matrix generation

The following bash script was run using <a href="http://www.htslib.org/">samtoools index</a> and <a href="https://htseq.readthedocs.io/en/master/count.html">htseq-count</a> to index and count the reads per feature in each cell.

We can also use <a href="http://www.htslib.org/">**samtools index**</a> and <a href="http://bioinf.wehi.edu.au/featureCounts/">**featureCounts**</a>. In general, this second option, **featureCounts** is much faster.

Both of these methods generate a text file. Each line contains a feature name and a feature count (and other information, if using featureCounts). We can cut the column containing the counts from each file and paste it together with the genomic coordinates to get a feature matrix, cell x feature. 

The output looks something like this (if you only had three cells!):

### Next steps
Now we are ready to move on to one of the supported analysis pipelines: *Scanpy*, *Seurat*, *PAGA*, and *STREAM*.