# Obtain RNA-seq test data.
The test data consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR). We have 3 complete experimental replicates for each sample. In addition, a spike-in control was used. Specifically we added an aliquot of the ERCC ExFold RNA Spike-In Control Mixes to each sample. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range. 

So to summarize we have:
1. UHR + ERCC Spike-In Mix1, Replicate 1
2. UHR + ERCC Spike-In Mix1, Replicate 2
3. UHR + ERCC Spike-In Mix1, Replicate 3
4. HBR + ERCC Spike-In Mix2, Replicate 1
5. HBR + ERCC Spike-In Mix2, Replicate 2
6. HBR + ERCC Spike-In Mix2, Replicate 3

Each data set has a corresponding pair of FASTQ files (read 1 and read 2 of paired end reads).

In [1]:
echo $RNA_DATA_DIR # RNA_DATA_DIR=/home/ubuntu/workspace/rnaseq/data
mkdir -p $RNA_DATA_DIR
cd $RNA_DATA_DIR
# Obtain data from course directory:
wget http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar 

/home/ubuntu/workspace/rnaseq/data
--2025-04-27 00:04:53--  http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar
Resolving genomedata.org (genomedata.org)... 54.71.55.4
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar [following]
--2025-04-27 00:04:53--  https://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116602880 (111M) [application/x-tar]
Saving to: ‘HBR_UHR_ERCC_ds_5pc.tar’


2025-04-27 00:04:57 (38.1 MB/s) - ‘HBR_UHR_ERCC_ds_5pc.tar’ saved [116602880/116602880]



In [2]:
tar -xvf HBR_UHR_ERCC_ds_5pc.tar # unpack tar file into 12 seperate fastq.gz file in black
ls # we'll see 12 fastq files in red and HBR_UHR_ERCC_ds_5pc.tar files

HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
[0m[01;31mHBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz[0m
[01;31mHBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz[0m
[01;31mHBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz[0m


The reads are paired-end 101-mers generated on an Illumina HiSeq instrument. The test data has been pre-filtered for reads that appear to map to chromosome 22. 

In [3]:
# View the  first two read records of a fastq.gz file, each read record spans 4 lines, so 8 lines in total
zcat UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz | head -n 8

@HWI-ST718_146963544:6:1213:8996:10047/1
CTTTTTTATTTTTGTCTGACTGGGTTGATTCAAAGGTCTGGTCTTTGAGCTCTTAAATTAGTTCTTCTATTTGGCCTAGTCTGTTGCTAAGGCTGCCAAC
+
CCCFFFFFHHHHGJHIIJHIHIIIFHIJJJJIJJGIBBFGEGGHIIHGGIJJIIHGGHIIIFGCGHHIIHIHHEEE?DFEFFFEEDCEEDDDDDDDBCDD
@HWI-ST718_146963544:5:2303:11793:37095/1
ATGAATTATAGGGCTGTATTTTAATTTTGCATTTTAAATTCCTGCAGTTTTCTTCCATCACTTTTCACCATGCATTGTATACTTGGAATTGCTTTTTGTG
+
@@??BDDFFF<FHEGFFGGIEBGHIIIIIBEHIIGIH<FHEFHHCHABF@DFHGGGII<DHBFGGGGBEGGIBHG@DHGIIIH@DE>CCHF:;>@BC>@@


### How many reads are there in the first library? 
Decompress file on the fly with ‘zcat’, pipe into ‘grep’, search for the read name prefix and pipe into ‘wc’ to do a word count (‘-l’ gives lines)

In [4]:
zcat UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz | grep -P "^\@HWI" | wc -l
# grep "^@HWI" would also work, -P is for advanced patterns

227392


# PRACTICAL EXERCISE 3
Assignment: Download an additional dataset and unpack it.

In [5]:
cd $RNA_HOME
mkdir -p practice/data
cd $RNA_HOME/practice/data
wget http://genomedata.org/rnaseq-tutorial/practical.tar
tar -xvf practical.tar

--2025-04-27 00:25:05--  http://genomedata.org/rnaseq-tutorial/practical.tar
Resolving genomedata.org (genomedata.org)... 54.71.55.4
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://genomedata.org/rnaseq-tutorial/practical.tar [following]
--2025-04-27 00:25:05--  https://genomedata.org/rnaseq-tutorial/practical.tar
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 363653120 (347M) [application/x-tar]
Saving to: ‘practical.tar’


2025-04-27 00:25:14 (44.5 MB/s) - ‘practical.tar’ saved [363653120/363653120]

hcc1395_normal_rep1_r1.fastq.gz
hcc1395_normal_rep1_r2.fastq.gz
hcc1395_normal_rep2_r1.fastq.gz
hcc1395_normal_rep2_r2.fastq.gz
hcc1395_normal_rep3_r1.fastq.gz
hcc1395_normal_rep3_r2.fastq.gz
hcc1395_tumor_rep1_r1.fastq.gz
hcc1395_tumor_rep1_r2.fastq.gz
hcc1395_tumor_rep2_r1.fastq.gz
hcc1395_tumor_re

### In the first read of the hcc1395, normal, replicate 1, read 1 file, what was the physical location of the read on the flow cell (i.e. lane, tile, x, y)?

In [6]:
zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 1

@K00193:38:H3MYFBBXX:4:1101:10003:44458/1


Lane = 4, tile = 1101, x = 10003, y = 44458.

### In the first read of this same file, how many ‘T’ bases are there?

In [7]:
zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 2 | tail -n 1 | grep -o T | wc -l
# grep -o (only) extracti specific substrings, eg "T", from larger lines 

32
