# Prepare Developing Forebrain Bulk-seq

## Data Source
- All the data downloaded from ENCODE project [here](https://www.encodeproject.org/search/?type=Experiment&status=released&lab.title=Barbara+Wold%2C+Caltech&biosample_ontology.term_name=forebrain)
- This dataset contains polyA+ RNA-seq for developing forebrain from 8 timepoints 
    - E10.5
    - E11.5
    - E12.5
    - E13.5
    - E14.5
    - E15.5
    - E16.5
    - P0
- Each time point has two replicates, a total of 16 RNA-seq
- Some experiments are pair-end sequencing, some are single end
- We start from a total of 25 fastq files from ENCODE, all experiments are single end sequencing


*Note: all fastq file and the salmon index doesn't included in the github, I just provide you the code I used to process it, but you don't need to rerun this on a laptop*

## Download FASTQ

In [5]:
# These are the URL to download files
!cat ./fastq/files.txt
# This command download all files, ~30 GBs, you don't need to repeat
# !xargs -L 1 curl -O -L < files.txt

https://www.encodeproject.org/metadata/?type=Experiment&status=released&files.file_type=fastq&lab.title=Barbara+Wold%2C+Caltech&biosample_ontology.term_name=forebrain
https://www.encodeproject.org/files/ENCFF270GKY/@@download/ENCFF270GKY.fastq.gz
https://www.encodeproject.org/files/ENCFF460TCF/@@download/ENCFF460TCF.fastq.gz
https://www.encodeproject.org/files/ENCFF126IRS/@@download/ENCFF126IRS.fastq.gz
https://www.encodeproject.org/files/ENCFF748SRJ/@@download/ENCFF748SRJ.fastq.gz
https://www.encodeproject.org/files/ENCFF931IVO/@@download/ENCFF931IVO.fastq.gz
https://www.encodeproject.org/files/ENCFF114DRT/@@download/ENCFF114DRT.fastq.gz
https://www.encodeproject.org/files/ENCFF959PSX/@@download/ENCFF959PSX.fastq.gz
https://www.encodeproject.org/files/ENCFF235DNM/@@download/ENCFF235DNM.fastq.gz
https://www.encodeproject.org/files/ENCFF329ACL/@@download/ENCFF329ACL.fastq.gz
https://www.encodeproject.org/files/ENCFF251LNG/@@download/ENCFF251LNG.fastq.gz
https://www.encodeproj

In [7]:
# rename the raw metadata file
!mv ?type\=Experiment\&status\=released\&files.file_type\=fastq\&lab.title\=Barbara+Wold%2C+Caltech\&biosample_ontology.term_name\=forebrain raw_metadata.csv

# check out what has been downloaded
!ls -hl ./fastq

total 62G
drwxr-xr-x 2 hanliu users 4.0K Apr 17 13:08 data
-rw-r--r-- 1 hanliu users 1.8G Apr 17 11:28 ENCFF037JQC.fastq.gz
-rw-r--r-- 1 hanliu users 3.1G Apr 17 11:12 ENCFF114DRT.fastq.gz
-rw-r--r-- 1 hanliu users 1.7G Apr 17 11:08 ENCFF126IRS.fastq.gz
-rw-r--r-- 1 hanliu users 4.3G Apr 17 11:39 ENCFF179JEC.fastq.gz
-rw-r--r-- 1 hanliu users 3.4G Apr 17 11:22 ENCFF203BWA.fastq.gz
-rw-r--r-- 1 hanliu users 4.3G Apr 17 11:17 ENCFF235DNM.fastq.gz
-rw-r--r-- 1 hanliu users 1.9G Apr 17 11:19 ENCFF251LNG.fastq.gz
-rw-r--r-- 1 hanliu users 2.0G Apr 17 11:05 ENCFF270GKY.fastq.gz
-rw-r--r-- 1 hanliu users 1.7G Apr 17 11:25 ENCFF294JRP.fastq.gz
-rw-r--r-- 1 hanliu users 2.1G Apr 17 11:33 ENCFF320FJX.fastq.gz
-rw-r--r-- 1 hanliu users 2.6G Apr 17 11:18 ENCFF329ACL.fastq.gz
-rw-r--r-- 1 hanliu users 1.8G Apr 17 11:30 ENCFF358MFI.fastq.gz
-rw-r--r-- 1 hanliu users 1.8G Apr 17 11:27 ENCFF447EXU.fastq.gz
-rw-r--r-- 1 hanliu users 1.8G Apr 17 11:29 ENCFF458NWF.fastq.gz
-rw-r--r-- 1 ha

## Rename file and make metadata

In [8]:
# pandas handle table, its the "excel" in python
import pandas as pd
# pathlib handle all path related stuff
import pathlib

In [10]:
metadata = pd.read_csv('./fastq/raw_metadata.tsv', sep='\t', index_col=0)
print('The dataframe shape', metadata.shape)
metadata.head()

The dataframe shape (25, 54)


Unnamed: 0_level_0,File format,File type,File format type,Output type,Experiment accession,Assay,Biosample term id,Biosample term name,Biosample type,Biosample organism,...,Assembly,Genome annotation,Platform,Controlled by,File Status,s3_uri,Audit WARNING,Audit INTERNAL_ACTION,Audit NOT_COMPLIANT,Audit ERROR
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENCFF329ACL,fastq,fastq,,reads,ENCSR160IIN,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2014/10/30/08ac2b2f-122f-4c...,,,,
ENCFF251LNG,fastq,fastq,,reads,ENCSR160IIN,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2014/10/30/b2b42120-3ca3-4c...,,,,
ENCFF896COV,fastq,fastq,,reads,ENCSR160IIN,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2014/10/30/a0df1ba6-c4a3-4b...,,,,
ENCFF959PSX,fastq,fastq,,reads,ENCSR970EWM,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2015/09/23/a57550bb-5941-4f...,,,,
ENCFF235DNM,fastq,fastq,,reads,ENCSR970EWM,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2015/09/23/d4d9c094-7e3b-43...,,,,


In [11]:
metadata.columns

Index(['File format', 'File type', 'File format type', 'Output type',
       'Experiment accession', 'Assay', 'Biosample term id',
       'Biosample term name', 'Biosample type', 'Biosample organism',
       'Biosample treatments', 'Biosample treatments amount',
       'Biosample treatments duration',
       'Biosample genetic modifications methods',
       'Biosample genetic modifications categories',
       'Biosample genetic modifications targets',
       'Biosample genetic modifications gene targets',
       'Biosample genetic modifications site coordinates',
       'Biosample genetic modifications zygosity', 'Experiment target',
       'Library made from', 'Library depleted in', 'Library extraction method',
       'Library lysis method', 'Library crosslinking method',
       'Library strand specific', 'Experiment date released', 'Project',
       'RBNS protein concentration', 'Library fragmentation method',
       'Library size range', 'Biological replicate(s)', 'Technical replica

### Clean metadata, select necessary columns

In [12]:
use_columns = [
    'Output type', 'Experiment accession', 'Biosample term id',
    'Biosample term name', 'Biological replicate(s)'
]

metadata_selected = metadata[use_columns].copy()
print('The dataframe shape', metadata_selected.shape)
metadata_selected.head()

The dataframe shape (25, 5)


Unnamed: 0_level_0,Output type,Experiment accession,Biosample term id,Biosample term name,Biological replicate(s)
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1


### Concatenate timepoint information

In [13]:
# I manually create this sample table, which contains the developmental time information
exp_meta = pd.read_csv('metadata/ENCODE_experiments.tsv', sep='\t', index_col=-1)
print('The dataframe shape', exp_meta.shape)
exp_meta.head()

The dataframe shape (8, 4)


Unnamed: 0_level_0,Stage,Tissue_short,Tissue_full,Data type
ENCODE accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENCSR304RDL,E10.5,FB,forebrain,RNA-seq
ENCSR160IIN,E11.5,FB,forebrain,RNA-seq
ENCSR647QBV,E12.5,FB,forebrain,RNA-seq
ENCSR970EWM,E13.5,FB,forebrain,RNA-seq
ENCSR185LWM,E14.5,FB,forebrain,RNA-seq


In [14]:
# These Experiment accession "ENCSR..." matched with the exp_meta, 
# our goal is add the Stage column from exp_meta to the metadata_selected
metadata_selected['Experiment accession'].head()

File accession
ENCFF329ACL    ENCSR160IIN
ENCFF251LNG    ENCSR160IIN
ENCFF896COV    ENCSR160IIN
ENCFF959PSX    ENCSR970EWM
ENCFF235DNM    ENCSR970EWM
Name: Experiment accession, dtype: object

In [15]:
# single line of code can do this
metadata_selected['Stage'] = metadata_selected['Experiment accession'].map(exp_meta['Stage'])

print('The dataframe shape', metadata_selected.shape)
metadata_selected.head()

The dataframe shape (25, 6)


Unnamed: 0_level_0,Output type,Experiment accession,Biosample term id,Biosample term name,Biological replicate(s),Stage
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5


### Check some questions
- Examples to validate your metadata

In [16]:
print('Number of tissues', 
      metadata_selected['Biosample term name'].unique().size)
metadata_selected['Biosample term name'].unique()

Number of tissues 1


array(['forebrain'], dtype=object)

In [17]:
print('Number of time points in each tissue')
metadata_selected.groupby('Biosample term name')['Stage'].apply(lambda i: i.unique().size)

Number of time points in each tissue


Biosample term name
forebrain    8
Name: Stage, dtype: int64

In [18]:
print('Does each experiment (tissue*stage) has 2 replicates?')

# In human words, this means the number of replicates in each tissue and stage combination all equal to 2
sum(
    metadata_selected\
    .groupby(['Biosample term name', 'Stage'])\
    .apply(lambda i: i['Biological replicate(s)'].unique().size) != 2
) == 0

Does each experiment (tissue*stage) has 2 replicates?


True

### Rename the files, add sample information

In [None]:
# create a output dir
!mkdir fastq/data

In [22]:
# rename the columns
rename_dict = {
    'Output type': 'count_type',
    'Experiment accession': 'experiment_id',
    'Biosample term id': 'bio_sample_id',
    'Biosample term name': 'tissue',
    'Biological replicate(s)': 'replicate',
    'Stage': 'dev_time'
}
metadata_selected.rename(columns=rename_dict, inplace=True)

In [23]:
# subprocess create a sub-process to execute certain shell command
import subprocess

In [25]:
output_path_records = []
for file_id, row in metadata_selected.iterrows():
    tissue, dev_time, rep = row[['tissue', 'dev_time', 'replicate']]
    
    # assemble the input file path and make it absolute
    input_path = pathlib.Path(f'./fastq/{file_id}.fastq.gz').absolute()
    
    # assemble the output file path
    output_path = f'./fastq/data/{tissue}_{dev_time}_{rep}_{file_id}.fastq.gz'.replace(' ', '')
    
    # save the output file path
    output_path_records.append(pathlib.Path(output_path).name)
    # and create the soft link
    subprocess.run(['ln', '-s', input_path, output_path])
    
metadata_selected['file_name'] = output_path_records
metadata_selected.head()

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,forebrain_E11.5_1_ENCFF329ACL.fastq.gz
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF251LNG.fastq.gz
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF896COV.fastq.gz
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,forebrain_E13.5_2_ENCFF959PSX.fastq.gz
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,forebrain_E13.5_1_ENCFF235DNM.fastq.gz


## Save the completed metadata table
In each step, I will generate a metadata table to record related informations about all files and samples. This table will guide the next step.

In [28]:
metadata_selected.to_csv('metadata/fastq_metadata.csv')

In [31]:
!ls -hl ./fastq/data/

total 0
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E10.5_1_ENCFF320FJX.fastq.gz -> /home/hanliu/project/genome_book/DevFB/fastq/ENCFF320FJX.fastq.gz
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E10.5_1_ENCFF920CNZ.fastq.gz -> /home/hanliu/project/genome_book/DevFB/fastq/ENCFF920CNZ.fastq.gz
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E10.5_2_ENCFF528EVC.fastq.gz -> /home/hanliu/project/genome_book/DevFB/fastq/ENCFF528EVC.fastq.gz
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E10.5_2_ENCFF663SNC.fastq.gz -> /home/hanliu/project/genome_book/DevFB/fastq/ENCFF663SNC.fastq.gz
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E11.5_1_ENCFF329ACL.fastq.gz -> /home/hanliu/project/genome_book/DevFB/fastq/ENCFF329ACL.fastq.gz
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E11.5_2_ENCFF251LNG.fastq.gz -> /home/hanliu/project/genome_book/DevFB/fastq/ENCFF251LNG.fastq.gz
lrwxrwxrwx 1 hanliu users 65 Apr 17 13:08 forebrain_E11.5_2_ENCFF896COV.fastq.gz -> /

In [32]:
metadata_selected

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,forebrain_E11.5_1_ENCFF329ACL.fastq.gz
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF251LNG.fastq.gz
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF896COV.fastq.gz
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,forebrain_E13.5_2_ENCFF959PSX.fastq.gz
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,forebrain_E13.5_1_ENCFF235DNM.fastq.gz
ENCFF270GKY,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,forebrain_E14.5_1_ENCFF270GKY.fastq.gz
ENCFF460TCF,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,forebrain_E14.5_1_ENCFF460TCF.fastq.gz
ENCFF126IRS,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,forebrain_E14.5_2_ENCFF126IRS.fastq.gz
ENCFF748SRJ,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,forebrain_E14.5_2_ENCFF748SRJ.fastq.gz
ENCFF447EXU,reads,ENCSR362AIZ,UBERON:0001890,forebrain,2,P0,forebrain_P0_2_ENCFF447EXU.fastq.gz
