# Prepare Developing Forebrain Bulk-seq (Demo)

# **All FASTQ files only contain 10000 reads form the source file downloaded below, so you can map it quickly for this demo**

## Aims of this notebook

1. prepare a clean metadata table, recording sample information for each FASTQ file
2. create soft-link to FASTQ files, the soft-link is a good way to rename those FASTQ files with a meaningful name, while keep the original file unchanged.

## Data Source
- All the data downloaded from ENCODE project [here](https://www.encodeproject.org/search/?type=Experiment&status=released&lab.title=Barbara+Wold%2C+Caltech&biosample_ontology.term_name=forebrain)
- This dataset contains polyA+ RNA-seq for developing forebrain from 8 timepoints 
    - E10.5
    - E11.5
    - E12.5
    - E13.5
    - E14.5
    - E15.5
    - E16.5
    - P0
- Each time point has two replicates, a total of 16 RNA-seq
- Some experiments are splited into multiple FASTQ files, some are not
- We start from a total of 25 fastq files from ENCODE, all experiments are single end sequencing

## Rename file and make metadata

In [1]:
# pandas handle table
import pandas as pd
# pathlib handle all path related
import pathlib

In [2]:
# Read this raw metadata table downloaded from ENCODE together with the FASTQ
metadata = pd.read_csv('../../data/DevFB/fastq/raw_metadata.tsv', sep='\t', index_col=0)
print('The dataframe shape', metadata.shape)
metadata.head()

The dataframe shape (25, 54)


Unnamed: 0_level_0,File format,File type,File format type,Output type,Experiment accession,Assay,Biosample term id,Biosample term name,Biosample type,Biosample organism,...,Assembly,Genome annotation,Platform,Controlled by,File Status,s3_uri,Audit WARNING,Audit INTERNAL_ACTION,Audit NOT_COMPLIANT,Audit ERROR
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENCFF329ACL,fastq,fastq,,reads,ENCSR160IIN,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2014/10/30/08ac2b2f-122f-4c...,,,,
ENCFF251LNG,fastq,fastq,,reads,ENCSR160IIN,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2014/10/30/b2b42120-3ca3-4c...,,,,
ENCFF896COV,fastq,fastq,,reads,ENCSR160IIN,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2014/10/30/a0df1ba6-c4a3-4b...,,,,
ENCFF959PSX,fastq,fastq,,reads,ENCSR970EWM,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2015/09/23/a57550bb-5941-4f...,,,,
ENCFF235DNM,fastq,fastq,,reads,ENCSR970EWM,RNA-seq,UBERON:0001890,forebrain,tissue,Mus musculus,...,,,Illumina HiSeq 2500,,released,s3://encode-public/2015/09/23/d4d9c094-7e3b-43...,,,,


In [3]:
metadata.columns

Index(['File format', 'File type', 'File format type', 'Output type',
       'Experiment accession', 'Assay', 'Biosample term id',
       'Biosample term name', 'Biosample type', 'Biosample organism',
       'Biosample treatments', 'Biosample treatments amount',
       'Biosample treatments duration',
       'Biosample genetic modifications methods',
       'Biosample genetic modifications categories',
       'Biosample genetic modifications targets',
       'Biosample genetic modifications gene targets',
       'Biosample genetic modifications site coordinates',
       'Biosample genetic modifications zygosity', 'Experiment target',
       'Library made from', 'Library depleted in', 'Library extraction method',
       'Library lysis method', 'Library crosslinking method',
       'Library strand specific', 'Experiment date released', 'Project',
       'RBNS protein concentration', 'Library fragmentation method',
       'Library size range', 'Biological replicate(s)', 'Technical replica

### Clean metadata, select necessary columns

In [4]:
# The raw metadata table contain many unnecessary columns, here we select important ones only
use_columns = [
    'Output type', 'Experiment accession', 'Biosample term id',
    'Biosample term name', 'Biological replicate(s)'
]

metadata_selected = metadata[use_columns].copy()
print('The dataframe shape', metadata_selected.shape)
metadata_selected.head()

The dataframe shape (25, 5)


Unnamed: 0_level_0,Output type,Experiment accession,Biosample term id,Biosample term name,Biological replicate(s)
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1


### Concatenate timepoint information

In [5]:
# I manually create this sample table so we can paste the "Stage" (development time point) with the metadata table
exp_meta = pd.read_csv('../../data/DevFB/metadata/ENCODE_experiments.tsv', sep='\t', index_col=-1)
print('The dataframe shape', exp_meta.shape)
exp_meta.head()

The dataframe shape (8, 4)


Unnamed: 0_level_0,Stage,Tissue_short,Tissue_full,Data type
ENCODE accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENCSR304RDL,E10.5,FB,forebrain,RNA-seq
ENCSR160IIN,E11.5,FB,forebrain,RNA-seq
ENCSR647QBV,E12.5,FB,forebrain,RNA-seq
ENCSR970EWM,E13.5,FB,forebrain,RNA-seq
ENCSR185LWM,E14.5,FB,forebrain,RNA-seq


In [6]:
# These Experiment accession "ENCSR..." matched with the exp_meta, 
# our goal is add the Stage column from exp_meta to the metadata_selected -- using these IDs
metadata_selected['Experiment accession'].head()

File accession
ENCFF329ACL    ENCSR160IIN
ENCFF251LNG    ENCSR160IIN
ENCFF896COV    ENCSR160IIN
ENCFF959PSX    ENCSR970EWM
ENCFF235DNM    ENCSR970EWM
Name: Experiment accession, dtype: object

In [7]:
# single line of code can do this, search pandas map function 
# or see documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
metadata_selected['Stage'] = metadata_selected['Experiment accession'].map(exp_meta['Stage'])

print('The dataframe shape', metadata_selected.shape)
metadata_selected.head()

The dataframe shape (25, 6)


Unnamed: 0_level_0,Output type,Experiment accession,Biosample term id,Biosample term name,Biological replicate(s),Stage
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5


### [OPTIONAL] Check some questions
- Examples to validate your metadata
- Demo of some complex logic check using pandas functions

In [8]:
print('Number of tissues', 
      metadata_selected['Biosample term name'].unique().size)
metadata_selected['Biosample term name'].unique()

Number of tissues 1


array(['forebrain'], dtype=object)

In [9]:
print('Number of time points in each tissue')
metadata_selected.groupby('Biosample term name')['Stage'].apply(lambda i: i.unique().size)

Number of time points in each tissue


Biosample term name
forebrain    8
Name: Stage, dtype: int64

In [10]:
print('Does each experiment (tissue*stage) has 2 replicates?')

# In human words, this means the number of replicates in each tissue and stage combination all equal to 2
# This is just a demo of how complex logic validation can be achieved with python.
sum(
    metadata_selected\
    .groupby(['Biosample term name', 'Stage'])\
    .apply(lambda i: i['Biological replicate(s)'].unique().size) != 2
) == 0

Does each experiment (tissue*stage) has 2 replicates?


True

### Rename the files, add sample information

In [11]:
# create a output dir using command line
!mkdir data/fastq

mkdir: data/fastq: File exists


In [12]:
# another way to create directory using pure python
output_dir = pathlib.Path('data/fastq')
output_dir.mkdir(exist_ok=True)

In [13]:
# rename the columns
rename_dict = {
    'Output type': 'count_type',
    'Experiment accession': 'experiment_id',
    'Biosample term id': 'bio_sample_id',
    'Biosample term name': 'tissue',
    'Biological replicate(s)': 'replicate',
    'Stage': 'dev_time'
}
metadata_selected.rename(columns=rename_dict, inplace=True)
metadata_selected.head()

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5


In [14]:
# subprocess create a sub-process to execute certain shell command
import subprocess

In [15]:
output_path_records = []
for file_id, row in metadata_selected.iterrows():
    tissue, dev_time, rep = row[['tissue', 'dev_time', 'replicate']]
    
    # assemble the input file path and make it absolute
    input_path = pathlib.Path(f'./data/small_fastq/{file_id}.fastq.gz').absolute()
    
    # assemble the output file path
    output_path = f'./data/fastq/{tissue}_{dev_time}_{rep}_{file_id}.fastq.gz'.replace(' ', '')
    
    # save the output file path
    output_path_records.append(pathlib.Path(output_path).name)
    
    # and create the soft link
    subprocess.run(['ln', '-s', input_path, output_path])
    
# add soft-link names to the metadata table
metadata_selected['file_name'] = output_path_records
metadata_selected.head()

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,forebrain_E11.5_1_ENCFF329ACL.fastq.gz
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF251LNG.fastq.gz
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF896COV.fastq.gz
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,forebrain_E13.5_2_ENCFF959PSX.fastq.gz
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,forebrain_E13.5_1_ENCFF235DNM.fastq.gz


In [16]:
# save the metadata table together with FASTQ files
metadata_selected.to_csv('data/fastq/fastq_metadata.csv')

## Output of this notebook

**In the ./data/fastq/ dir**

1. We have soft links of with meaningful names to the 25 original FASTQ files (named by ID from the database)
2. We have a clean metadata table recording all sample and database ID informations for these 25 files.

In [17]:
!ls -hl ./data/fastq/

total 8
-rw-r--r--@   1 hq  staff   2.6K May 31 23:13 fastq_metadata.csv
lrwxr-xr-x    1 hq  staff   101B May 20 14:11 [35mforebrain_E10.5_1_ENCFF320FJX.fastq.gz[m[m -> /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/small_fastq/ENCFF320FJX.fastq.gz
lrwxr-xr-x    1 hq  staff   101B May 20 14:11 [35mforebrain_E10.5_1_ENCFF920CNZ.fastq.gz[m[m -> /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/small_fastq/ENCFF920CNZ.fastq.gz
lrwxr-xr-x    1 hq  staff   101B May 20 14:11 [35mforebrain_E10.5_2_ENCFF528EVC.fastq.gz[m[m -> /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/small_fastq/ENCFF528EVC.fastq.gz
lrwxr-xr-x    1 hq  staff   101B May 20 14:11 [35mforebrain_E10.5_2_ENCFF663SNC.fastq.gz[m[m -> /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/small_fastq/ENCFF663SNC.fastq.gz
lrwxr-xr-x    1 hq  staff   101B May 20 14:11 [35mforebrain_E11.5_1_ENCFF329ACL.fastq.gz[m[m -> /Users/hq/Document

In [18]:
metadata_selected

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,forebrain_E11.5_1_ENCFF329ACL.fastq.gz
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF251LNG.fastq.gz
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF896COV.fastq.gz
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,forebrain_E13.5_2_ENCFF959PSX.fastq.gz
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,forebrain_E13.5_1_ENCFF235DNM.fastq.gz
ENCFF270GKY,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,forebrain_E14.5_1_ENCFF270GKY.fastq.gz
ENCFF460TCF,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,forebrain_E14.5_1_ENCFF460TCF.fastq.gz
ENCFF126IRS,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,forebrain_E14.5_2_ENCFF126IRS.fastq.gz
ENCFF748SRJ,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,forebrain_E14.5_2_ENCFF748SRJ.fastq.gz
ENCFF447EXU,reads,ENCSR362AIZ,UBERON:0001890,forebrain,2,P0,forebrain_P0_2_ENCFF447EXU.fastq.gz


## Homework

Learn python pathlib package by yourself. (Hint: the most important class in this pacakge is the Path class, you don't need to learn everything about the whole package, try to search some simple tutorial on google)