# Salmon Mapping

## Aim of this notebook
1. Use salmon to quantify all the 25 trimmed FASTQ files for 16 samples
2. Generate a metadata table for salmon quant tables (16 sample)

## Prepare mapping commands


In [1]:
import pandas as pd
import pathlib

In [2]:
fastq_meta = pd.read_csv('./data/fastq/trimmed/trimmed_fastq_metadata.csv', index_col=0)
fastq_meta

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF270GKY,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF460TCF,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF126IRS,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF748SRJ,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF447EXU,reads,ENCSR362AIZ,UBERON:0001890,forebrain,2,P0,/Users/hq/Documents/pkg/py_genome_sci_book/ana...


## Prepare salmon command for each sample

In [3]:
# output dir
output_dir = pathlib.Path('data/quant/').absolute()
output_dir.mkdir(exist_ok=True)

# set all the directories
index_dir = pathlib.Path('data/salmon_index/').absolute()

In [4]:
# salmon accept threads parameter to allow parallel acceleration
# Change this number based on cores you have in your computer
# also, because salmon run in parallel internally, we just run salmon commands one by one
threads = 4

## Key step: merge FASTQ files from same sample together in one salmon command

There are 25 FASTQ files for 16 samples. Some samples (e.g. ENCSR160IIN Replicate 2) have multiple FASTQ file. When quantify reads using Salmon, you have to provide FASTQ files for the same sample together.

In the cell bellow, I use a key function of pandas called pd.DataFrame.groupby() to do this. 
See pandas documentation for more about groupby function
groupby: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#splitting-an-object-into-groups

this is a very important and frquent used function

In [5]:
# make command for each RNA-seq sample based on the metadata
commands = {}
for (tissue, time, rep), sub_df in fastq_meta.groupby(['tissue', 'dev_time', 'replicate']):
    fastq_paths_str = ' '.join(sub_df['file_name'])
    output_name = output_dir / f'{tissue}_{time}_{rep}.quant'
    
    # assemble the final command
    command = f'salmon quant -i {index_dir} -l A -r {fastq_paths_str} --threads {threads} --validateMappings -o {output_name}'
    commands[f'{tissue}_{time}_{rep}'] = command

In [6]:
# a example command
command

'salmon quant -i /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/salmon_index -l A -r /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_P0_2_ENCFF447EXU_trimmed.fq.gz /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_P0_2_ENCFF458NWF_trimmed.fq.gz --threads 4 --validateMappings -o /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/quant/forebrain_P0_2.quant'

## Run salmon

In [7]:
import subprocess
for name, command in commands.items():
    # once command is finished, you may want to keep a physical record, so you know its finished for sure
    # you can also use this physical to prevent rerun the command, if the execution stopped in some place
    if pathlib.Path(output_dir / name).exists():
        print('EXISTS', name)
        continue
    
    subprocess.run(command, shell=True, check=True, 
                   stdout=subprocess.PIPE, stderr=subprocess.PIPE, encoding='utf8')
    
    print('FINISH', name)
    with open(output_dir / name, 'w') as f:
        f.write('Oh Yeah')
    

EXISTS forebrain_E10.5_1
EXISTS forebrain_E10.5_2
EXISTS forebrain_E11.5_1
EXISTS forebrain_E11.5_2
EXISTS forebrain_E12.5_1
EXISTS forebrain_E12.5_2
EXISTS forebrain_E13.5_1
EXISTS forebrain_E13.5_2
EXISTS forebrain_E14.5_1
EXISTS forebrain_E14.5_2
EXISTS forebrain_E15.5_1
EXISTS forebrain_E15.5_2
EXISTS forebrain_E16.5_1
EXISTS forebrain_E16.5_2
EXISTS forebrain_P0_1
EXISTS forebrain_P0_2


## Clean up the flag

In [8]:
# optional, delete the flag files
# for name in commands.keys():
#     subprocess.run(f'rm {output_dir / name}', shell=True)

## Make a metadata for salmon output

In [9]:
# find out all the trimmed fastq, make a dict
salmon_quant_list = list(output_dir.glob('**/quant.sf'))
salmon_quant_list[:5]

[PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/quant/forebrain_E13.5_2.quant/quant.sf'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/quant/forebrain_P0_2.quant/quant.sf'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/quant/forebrain_E12.5_1.quant/quant.sf'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/quant/forebrain_E14.5_1.quant/quant.sf'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/quant/forebrain_E15.5_2.quant/quant.sf')]

In [10]:
pd.read_csv(salmon_quant_list[0], nrows=10, sep='\t', index_col=0)

Unnamed: 0_level_0,Length,EffectiveLength,TPM,NumReads
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSMUST00000193812.1,1070,821.0,0.0,0.0
ENSMUST00000082908.1,110,4.749,0.0,0.0
ENSMUST00000162897.1,4153,3904.0,0.0,0.0
ENSMUST00000159265.1,2989,2740.0,0.0,0.0
ENSMUST00000070533.4,3634,3385.0,0.0,0.0
ENSMUST00000192857.1,480,231.0,0.0,0.0
ENSMUST00000195335.1,2819,2570.0,0.0,0.0
ENSMUST00000192336.1,2233,1984.0,0.0,0.0
ENSMUST00000194099.1,2309,2060.0,0.0,0.0
ENSMUST00000161581.1,250,20.633,0.0,0.0


In [11]:
records = []
for path in salmon_quant_list:
    # parse all sample information from the file name
    # In further analysis, we don't necessarily need database ID anymore, 
    # so here I chose to keep the metadata minimum
    sample_id = path.parent.name[:-6]
    tissue, time, rep = sample_id.split('_')
    records.append([sample_id, tissue, time, rep, str(path)])

# generate 
salmon_metadata = pd.DataFrame(records, 
                               columns=['sample_id', 'tissue', 'dev_time', 'replicate', 'salmon_count_path']
                              ).set_index('sample_id')
salmon_metadata.to_csv('data/quant/salmon_metadata.csv')

## Output of this notebook
In ./data/quant/ directory

1. We have subdirectories for each sample generated by salmon quant, within each the sample

In [12]:
# tree is a special command allows pretty printing of directory structure, 
# if you don't have this command, search how to install tree for linux
# or, just use the ls command
!tree ./data/quant/

./data/quant/
├── forebrain_E10.5_1
├── forebrain_E10.5_1.quant
│   ├── aux_info
│   │   ├── ambig_info.tsv
│   │   ├── expected_bias.gz
│   │   ├── fld.gz
│   │   ├── meta_info.json
│   │   ├── observed_bias.gz
│   │   └── observed_bias_3p.gz
│   ├── cmd_info.json
│   ├── libParams
│   │   └── flenDist.txt
│   ├── lib_format_counts.json
│   ├── logs
│   │   └── salmon_quant.log
│   └── quant.sf
├── forebrain_E10.5_2
├── forebrain_E10.5_2.quant
│   ├── aux_info
│   │   ├── ambig_info.tsv
│   │   ├── expected_bias.gz
│   │   ├── fld.gz
│   │   ├── meta_info.json
│   │   ├── observed_bias.gz
│   │   └── observed_bias_3p.gz
│   ├── cmd_info.json
│   ├── libParams
│   │   └── flenDist.txt
│   ├── lib_format_counts.json
│   ├── logs
│   │   └── salmon_quant.log
│   └── quant.sf
├── forebrain_E11.5_1
├── forebrain_E11.5_1.quant
│   ├── aux_info
│   │   ├── ambig_info.tsv
│   │   ├── expected_bias.gz
│   │   ├── fld.gz
│   │   ├── meta_info.json
│   │  

In [13]:
salmon_metadata

Unnamed: 0_level_0,tissue,dev_time,replicate,salmon_count_path
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
forebrain_E13.5_2,forebrain,E13.5,2,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_P0_2,forebrain,P0,2,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E12.5_1,forebrain,E12.5,1,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E14.5_1,forebrain,E14.5,1,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E15.5_2,forebrain,E15.5,2,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E12.5_2,forebrain,E12.5,2,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_P0_1,forebrain,P0,1,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E13.5_1,forebrain,E13.5,1,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E15.5_1,forebrain,E15.5,1,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
forebrain_E14.5_2,forebrain,E14.5,2,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
