# Data Preparation Ribo-seq Reads

## Clipping Reads to remove sequence adapters

For every replicate, you define the paths where to find the input and where to save the output in order to execute the clipper. Using qsub all tasks are launched at the same time.

In [12]:
%%bash

# Edit manifest file to trim every replicate of each sample
manifest='../../../Scripts/1_Data_Preparation/Ribo_seq/Clipping/manifest.ini'
# Set path to save logs 
logs='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Clipping/runDataPreparation.sh $manifest $logs


## UMI Detection Bar codes : UMI whiteList

Unique Molecular Identifiers (UMI) are used to find and remove PCR duplicates

In [4]:
%%bash
# To Identify UMI Bar codes

nameTask='UMI_' # Name your qsub task identifier
saveOutputQsub='.../qsub_outputs' # Path to save outputs
logPath='.../logs' # Path to save logs

masterDirectory='.../Data_Preparation/Ribo_Seq_Data/ClippedData/' # Path where to find the clipped reads

 for entry in "$masterDirectory"*
    do
        if [[ $entry == *"fastq"* ]]; then
           IFS='/' read -ra ADDR <<< $entry
           fastaName=${ADDR[-1]}
           IFS='.' read -ra ADDR <<< $fastaName
           newNameTask=$nameTask${ADDR[0]}
           outPutWhiteList=$masterDirectory$newNameTask'_whitelist.txt'
           sh ../../../Scripts/1_Data_Preparation/Ribo_seq/UMI/UMI_WhiteList.sh $entry $outPutWhiteList $newNameTask $saveOutputQsub $logPath
        fi
    done


## UMI extraction

In [6]:
%%bash
# To Remove UMI Bar codes

nameTask='UMI_Extract_'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

masterDirectory='.../Data_Preparation/Ribo_Seq_Data/ClippedData/'

for entry in "$masterDirectory"*
    do
        if [[ $entry == *"fastq"* ]]; then
           IFS='/' read -ra ADDR <<< $entry
           fastaName=${ADDR[-1]}
           IFS='.' read -ra ADDR <<< $fastaName
           newNameTask=$nameTask${ADDR[0]}
           oldNameTask='UMI_'${ADDR[0]}
           outPutWhiteList=$masterDirectory$oldNameTask'_whitelist.txt'
           outPutFastq=$masterDirectory${ADDR[0]}'.UMI.fastq'
           sh ../../../Scripts/1_Data_Preparation/Ribo_seq/UMI/UMI_Extract.sh $entry $outPutFastq $outPutWhiteList $newNameTask $saveOutputQsub $logPath
        fi
    done

## Filtering By Length

After the above steps we keep Ribosome Profiling reads that have a size between 26-34 bp lenght.

This must do for each sample and for each Ribo-TIS et Ribo-Elongation fasta reads

Input Files
```bash
"""
input : path
    Path to clipped files : Clipped_{NAME_SAMPLE}.UMI.fastq : Must include the name of the output file corresponding to the NAME_SAMPLE
output : path
    Path to save the output : Filetered files : Must include the name of the output file corresponding to the NAME_SAMPLE
qsubPath : path
    Path to save qsub output
logPath : path
    Path to save log output
"""
```


In [7]:
%%bash

input='.../Data_Preparation/Ribo_Seq_Data/ClippedData/Clipped_{NAME_SAMPLE}.UMI.fastq'
output='.../Data_Preparation/Ribo_Seq_Data/Filtered_by_Length_Data/Filtered_reads{NAME_SAMPLE}.fastq'
qsubPath='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Filtering/0_toRunFilter.sh $input $output $qsubPath $logPath


## Alignment Reads Ribosome Profiling 

Alignment was done end to end using the STAR aligner against the human genome (hg38).

Aligment of Ribo-TIS and Ribo-Elongation reads for a given sample

Input Files
```bash
"""
genomeDirectory : path
    Path to the index of the Genome
outputFilterMatchInt :  int
    Number of minimum matched bases 
inputFilesR1_1: path
    Path to files that contain input reads : Replicate 1
inputFilesR1_2 : path
    Path to files that contain input reads : Replicate 2
outputFile : path
    Path where output will be store
anchorMM : int
    Max number of loci anchors are allowed to map to
nameTask : string 
    qsub names task : To identify the task 
saveOutputQsub : path
    Paht to save qsub output 
logPath : path
    Path to save log output 
"""
```
Output Files
```bash
"""
    Any STAR alignment output will be store in $outputFile
    e.g. Bam File
"""
```


In [13]:
%%bash
outputFilterMatchInt=25
anchorMM=150

echo "Alignment to The genome -- RiboTis"
genomeDirectory='../../../Data_Input_Scripts/IndexStarRibosomeProfiling/'
inputFilesR1_1='.../Data_Preparation/Ribo_Seq_Data/Filtered_by_Length_Data/....fastq'
inputFilesR1_2='.../Data_Preparation/Ribo_Seq_Data/Filtered_by_Length_Data/....fastq'
outputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/'
nameTask='Mapping_RiboTis_reads'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/alignmentSingleEnd_Ribo_Seq.sh $genomeDirectory $outputFilterMatchInt $inputFilesR1_1 $inputFilesR1_2 $outputFile $anchorMM $nameTask $saveOutputQsub $logPath

echo "Alignment to The genome -- RiboElong"
genomeDirectory='../../../Data_Input_Scripts/IndexStarRibosomeProfiling/'
inputFilesR1_1='.../Data_Preparation/Ribo_Seq_Data/Filtered_by_Length_Data/....fastq'
inputFilesR1_2='.../Data_Preparation/Ribo_Seq_Data/Filtered_by_Length_Data/....fastq'
outputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/'
nameTask='Mapping_RiboElong_reads'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/alignmentSingleEnd_Ribo_Seq.sh $genomeDirectory $outputFilterMatchInt $inputFilesR1_1 $inputFilesR1_2 $outputFile $anchorMM $nameTask $saveOutputQsub $logPath


Alignment to The genome -- RiboTis
Alignment to The genome -- RiboElong


## IndexFile Bam File

To generate a sorted Index to each Bam File

Input Files
```bash
"""
inputFile : path 
    Path to file that contain the BAM File
outputFile : path 
    Path where output index will be store
nameTask : string
    qsub name task 
saveOutputQsub : path
    Paht to save qsub output 
logPath : path
    Path to save log output 
"""
```
Output File
```bash
"""
    bai index will be store in $outputFile
"""
```

In [9]:
%%bash

echo 'Index For Bam File Ribo Tis'
inputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/...bam'
outputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/...bai'
nameTask='IndexForBAMFile_RiboTis'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/creationIndexBamFile.sh $inputFile $outputFile $nameTask $saveOutputQsub $logPath

echo 'Index For Bam File Ribo Elong'
inputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/...bam'
outputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/...bai'
nameTask='IndexForBAMFile_RiboElongation'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/creationIndexBamFile.sh $inputFile $outputFile $nameTask $saveOutputQsub $logPath


Index For Bam File Ribo Tis
Index For Bam File Ribo Elong


## Deduplication Bam File

Remove PCR duplicates according to UMI : Using umi tools

Input Files
```bash
"""
saveOutputQsub : path
    Path to save qsub output
logPath : path
    Path to save log output
entry : path
    Path to BAM file that will be deduplicated
outPutBamFile : path
    Path where new DD bam file will be stored
outPutFolder : path
    Path to save additional output file from UMI tools
nameTask : string
    qsub name task
"""
```
Output File
```bash
"""
    Deduplicated bam file will be store in $outPutBamFile
"""
```

In [10]:
%%bash

echo 'Deduplication Ribo TIS'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'
entry='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/'
outPutBamFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/...DD.bam'
outPutFolder='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/'
nameTask='Deduplication_Ribo_TIS'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/deduplication.sh $entry $outPutBamFile $outPutFolder $nameTask $saveOutputQsub $logPath

echo 'Deduplication Ribo Elong'

saveOutputQsub='.../qsub_outputs'
logPath='.../logs'
entry='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/'
outPutBamFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/...DD.bam'
outPutFolder='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/'
nameTask='Deduplication_Ribo_Elong'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/deduplication.sh $entry $outPutBamFile $outPutFolder $nameTask $saveOutputQsub $logPath



Deduplication Ribo TIS
Deduplication Ribo Elong


## Index Bam File

Input Files
```bash
"""
inputFile : path 
    Path to file that contain the BAM File
outputFile : path 
    Path where output index will be store
nameTask : string
    qsub name task 
saveOutputQsub : path
    Paht to save qsub output 
logPath : path
    Path to save log output 
"""
```
Output File
```bash
"""
    bai index will be store in $outputFile
"""
```

In [11]:
%%bash

echo 'Index For Bam File Ribo Tis'
inputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/...DD.bam'
outputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/TIS/...DD.bai'
nameTask='IndexForBAMFile_RiboTis'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/creationIndexBamFile.sh $inputFile $outputFile $nameTask $saveOutputQsub $logPath

echo 'Index For Bam File Ribo Elong'
inputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/...DD.bam'
outputFile='.../Data_Preparation/Ribo_Seq_Data/Alignment_Reads_Genome/Ribo/Elong/...DD.bai'
nameTask='IndexForBAMFile_RiboElongation'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/1_Data_Preparation/Ribo_seq/Alignment/creationIndexBamFile.sh $inputFile $outputFile $nameTask $saveOutputQsub $logPath


Index For Bam File Ribo Tis
Index For Bam File Ribo Elong
