This notebook will download all data. We assume the working directory is the root cibog dir!

In [2]:
import os
from IPython.display import Image


if os.getcwd().endswith("code"):
    os.chdir("..")


In [18]:
! pwd

/mnt/w/game_cibog


# Install sratoolkit to download SRA-hosted sequencing data

For this we first need to install sratoolkit for downloading SRA-hosted files.

In [None]:
! wget --output-document tools/sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

! cd tools && tar -vxzf sratoolkit.tar.gz

Briefly run vdb-config and exit without changes

In [None]:
! vdb-config --interactive

# Downloading and preparing data

## Bulk Data

## Downloading and processing bulk data from reads

In order to be able to process the bulk data from reads we need several softwares and references.

Make sure to install hisat2 and samtools (e.g. sudo apt-get install hisat2/samtools in Ubuntu).

We also need featureCounts which is part of the subreads package: https://sourceforge.net/projects/subread/files/subread-2.0.3/ . Also make sure install a copy of that :)

In [10]:
! cd tools && wget -O subread-2.0.3-source.tar.gz https://jztkft.dl.sourceforge.net/project/subread/subread-2.0.3/subread-2.0.3-source.tar.gz

--2022-01-19 13:41:23--  https://jztkft.dl.sourceforge.net/project/subread/subread-2.0.3/subread-2.0.3-source.tar.gz
Resolving jztkft.dl.sourceforge.net (jztkft.dl.sourceforge.net)... 45.67.159.245
Connecting to jztkft.dl.sourceforge.net (jztkft.dl.sourceforge.net)|45.67.159.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23304665 (22M) [application/x-gzip]
Saving to: ‘subread-2.0.3-source.tar.gz’


2022-01-19 13:41:29 (3.81 MB/s) - ‘subread-2.0.3-source.tar.gz’ saved [23304665/23304665]



Extract the subread-archive in the tools folder:

In [11]:
! cd tools && tar xfz subread-2.0.3-source.tar.gz

And build the subread package:

In [12]:
! cd tools/subread-2.0.3-source/src && make -f Makefile.Linux -j 4

gcc  -mtune=core2  -O3 -DMAKE_FOR_EXON  -D MAKE_STANDALONE -D SUBREAD_VERSION=\""2.0.3"\"  -D_FILE_OFFSET_BITS=64    -fmessage-length=0  -ggdb     -c -o core.o core.c
gcc  -mtune=core2  -O3 -DMAKE_FOR_EXON  -D MAKE_STANDALONE -D SUBREAD_VERSION=\""2.0.3"\"  -D_FILE_OFFSET_BITS=64    -fmessage-length=0  -ggdb     -c -o core-junction.o core-junction.c
gcc  -mtune=core2  -O3 -DMAKE_FOR_EXON  -D MAKE_STANDALONE -D SUBREAD_VERSION=\""2.0.3"\"  -D_FILE_OFFSET_BITS=64    -fmessage-length=0  -ggdb     -c -o core-indel.o core-indel.c
gcc  -mtune=core2  -O3 -DMAKE_FOR_EXON  -D MAKE_STANDALONE -D SUBREAD_VERSION=\""2.0.3"\"  -D_FILE_OFFSET_BITS=64    -fmessage-length=0  -ggdb     -c -o sambam-file.o sambam-file.c
gcc  -mtune=core2  -O3 -DMAKE_FOR_EXON  -D MAKE_STANDALONE -D SUBREAD_VERSION=\""2.0.3"\"  -D_FILE_OFFSET_BITS=64    -fmessage-length=0  -ggdb     -c -o sublog.o sublog.c
gcc  -mtune=core2  -O3 -DMAKE_FOR_EXON  -D MAKE_STANDALONE -D SUBREAD_VERSION=\""2.0.3"\"  -D_FILE_OFFSET_BITS=64    

We can now check that featureCounts was built and is executable:

In [15]:
! tools/subread-2.0.3-source/bin/featureCounts -v


featureCounts v2.0.3



Now that everything is installed, let's download and rename the files! At the end we will download all required references (genome for alignment + genome annotation for counting):

We first navigate from the GEO Series https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131102 to a single sample (GSM) we want to download. There we follow the relation to SRA (at the bottom):

<img src="../images/gsm_record.png" />

On the SRX record page (SRA Experiment) you can find the SRR (SRA Run) IDs of all relevant files. Here, we expect unpaired reads and just one file per sample. You can now download the SRR-run data with fastq-dump from the command line:

<img src="../images/srx_record.png"/>

In [4]:
#24h R1
! cd reads/bulk/ && fastq-dump --gzip SRR9048093
#24h R1
! cd reads/bulk/ && fastq-dump --gzip SRR9048094
#JQ1 24h R1
! cd reads/bulk/ && fastq-dump --gzip SRR9048103
#JQ1 24h R1
! cd reads/bulk/ && fastq-dump --gzip SRR9048104


Read 17872256 spots for SRR9048093
Written 17872256 spots for SRR9048093
Read 40740514 spots for SRR9048094
Written 40740514 spots for SRR9048094
Read 43554379 spots for SRR9048103
Written 43554379 spots for SRR9048103
Read 28277603 spots for SRR9048104
Written 28277603 spots for SRR9048104


For easier refrence we rename the SRR files to more interpretable names:

In [6]:
! mv reads/bulk/SRR9048093.fastq.gz reads/bulk/SUM159_24h_R1.fastq.gz
! mv reads/bulk/SRR9048094.fastq.gz reads/bulk/SUM159_24h_R2.fastq.gz
! mv reads/bulk/SRR9048103.fastq.gz reads/bulk/SUM159R_24h_R1.fastq.gz
! mv reads/bulk/SRR9048104.fastq.gz reads/bulk/SUM159R_24h_R2.fastq.gz

Hisat2 is going to be used for aligning the reads to the reference genome. You should have already installed Hisat2 (else: sudo apt-get install hisat2). We chose to download the GRCh38 reference genome. Instructions on how to run Hisat2, or for other reference genomes, visit: http://daehwankimlab.github.io/hisat2/ 

In [5]:
! cd references && wget -O hsapiens_hisat2.tar.gz https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz

--2022-01-19 12:36:56--  https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
Resolving genome-idx.s3.amazonaws.com (genome-idx.s3.amazonaws.com)... 52.217.18.212
Connecting to genome-idx.s3.amazonaws.com (genome-idx.s3.amazonaws.com)|52.217.18.212|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4210306865 (3.9G) [binary/octet-stream]
Saving to: ‘hsapien_hisat2.tar.gz’


2022-01-19 12:53:16 (4.10 MB/s) - ‘hsapien_hisat2.tar.gz’ saved [4210306865/4210306865]



We also extract this tar.gz file:

In [7]:
! cd references && tar xfz hsapien_hisat2.tar.gz

And download the genome reference from ensembl. You should ensure that this matches the reference genome in terms of chromosome identifiers (chr1 vs 1):

In [21]:
! cd references && wget -O ensembl.105.annotation.gtf.gz http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz

--2022-01-19 14:24:48--  http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.197.76
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.197.76|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50829040 (48M) [application/x-gzip]
Saving to: ‘ensembl.105.annotation.gtf.gz’


2022-01-19 14:25:14 (1.90 MB/s) - ‘ensembl.105.annotation.gtf.gz’ saved [50829040/50829040]



In [22]:
! cd references && gunzip ensembl.105.annotation.gtf.gz

## Download preprocessed gene expression files from GEO

Instead of downloading the reads and doing the alignment+counting by oneself, one can also rely on the processed outputs of the original authors. These are available from each GEO sample (GSM record):

First the 24h samples:

<img src="../images/bulk_sample_gse.png" />

In [3]:
! cd bulk && wget -O GSM3763467_SUM159_DMSO_24h_R1.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763467&format=file&file=GSM3763467%5FSUM159%5FDMSO%5F24h%5FR1%2Ecounts%2Etxt%2Egz"
! cd bulk && wget -O GSM3763468_SUM159_DMSO_24h_R2.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763468&format=file&file=GSM3763468%5FSUM159%5FDMSO%5F24h%5FR2%2Ecounts%2Etxt%2Egz"

! cd bulk && wget -O GSM3763477_SUM159_JQ1R_DMSO_24h_R1.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763477&format=file&file=GSM3763477%5FSUM159%5FJQ1R%5FDMSO%5F24h%5FR1%2Ecounts%2Etxt%2Egz"
! cd bulk && wget -O GSM3763478_SUM159_JQ1R_DMSO_24h_R2.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763478&format=file&file=GSM3763478%5FSUM159%5FJQ1R%5FDMSO%5F24h%5FR2%2Ecounts%2Etxt%2Egz"

--2022-01-18 14:08:01--  https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763467&format=file&file=GSM3763467%5FSUM159%5FDMSO%5F24h%5FR1%2Ecounts%2Etxt%2Egz
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96895 (95K) [application/octet-stream]
Saving to: ‘GSM3763467_SUM159_DMSO_24h_R1.counts.txt.gz’


2022-01-18 14:08:02 (420 KB/s) - ‘GSM3763467_SUM159_DMSO_24h_R1.counts.txt.gz’ saved [96895/96895]

--2022-01-18 14:08:02--  https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763468&format=file&file=GSM3763468%5FSUM159%5FDMSO%5F24h%5FR2%2Ecounts%2Etxt%2Egz
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10

Then the 3h samples:

In [4]:
! cd bulk && wget -O GSM3763469_SUM159_DMSO_3h_R1.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763469&format=file&file=GSM3763469%5FSUM159%5FDMSO%5F3h%5FR1%2Ecounts%2Etxt%2Egz"
! cd bulk && wget -O GSM3763470_SUM159_DMSO_3h_R2.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763470&format=file&file=GSM3763470%5FSUM159%5FDMSO%5F3h%5FR2%2Ecounts%2Etxt%2Egz"

! cd bulk && wget -O GSM3763479_SUM159_JQ1R_DMSO_3h_R1.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763479&format=file&file=GSM3763479%5FSUM159%5FJQ1R%5FDMSO%5F3h%5FR1%2Ecounts%2Etxt%2Egz"
! cd bulk && wget -O GSM3763480_SUM159_JQ1R_DMSO_3h_R2.counts.txt.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763480&format=file&file=GSM3763480%5FSUM159%5FJQ1R%5FDMSO%5F3h%5FR2%2Ecounts%2Etxt%2Egz"

--2022-01-18 14:08:19--  https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763469&format=file&file=GSM3763469%5FSUM159%5FDMSO%5F3h%5FR1%2Ecounts%2Etxt%2Egz
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101284 (99K) [application/octet-stream]
Saving to: ‘GSM3763469_SUM159_DMSO_3h_R1.counts.txt.gz’


2022-01-18 14:08:20 (312 KB/s) - ‘GSM3763469_SUM159_DMSO_3h_R1.counts.txt.gz’ saved [101284/101284]

--2022-01-18 14:08:20--  https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM3763470&format=file&file=GSM3763470%5FSUM159%5FDMSO%5F3h%5FR2%2Ecounts%2Etxt%2Egz
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 977

In [16]:
! ls bulk

24h_sum159_vs_sum159_jq1r.tsv
3h_sum159_vs_sum159_jq1r.tsv
GSM3763467_SUM159_DMSO_24h_R1.counts.txt.gz
GSM3763468_SUM159_DMSO_24h_R2.counts.txt.gz
GSM3763469_SUM159_DMSO_3h_R1.counts.txt.gz
GSM3763470_SUM159_DMSO_3h_R2.counts.txt.gz
GSM3763477_SUM159_JQ1R_DMSO_24h_R1.counts.txt.gz
GSM3763478_SUM159_JQ1R_DMSO_24h_R2.counts.txt.gz
GSM3763479_SUM159_JQ1R_DMSO_3h_R1.counts.txt.gz
GSM3763480_SUM159_JQ1R_DMSO_3h_R2.counts.txt.gz
normed_expr.24h_sum159_vs_sum159_jq1r.tsv
normed_expr.3h_sum159_vs_sum159_jq1r.tsv
sig.24h_sum159_vs_sum159_jq1r.tsv
sig.3h_sum159_vs_sum159_jq1r.tsv
sig.down.24h_sum159_vs_sum159_jq1r.tsv
sig.down.3h_sum159_vs_sum159_jq1r.tsv
sig.sum159_vs_sum159_jq1r.tsv
sig.up.24h_sum159_vs_sum159_jq1r.tsv
sig.up.3h_sum159_vs_sum159_jq1r.tsv
top.all.24h_sum159_vs_sum159_jq1r.tsv
top.all.3h_sum159_vs_sum159_jq1r.tsv
top.down.24h_sum159_vs_sum159_jq1r.tsv
top.down.3h_sum159_vs_sum159_jq1r.tsv
top.up.24h_sum159_vs_sum159_jq1r.tsv
top.up.3h_sum159_vs_sum159_jq1r.tsv


## Download scRNA-seq data

How to download scRNA-seq data is explained in the next notebook!