# Normalization and phenotype table generation for splicingQTL analysis


## Methods

Leafcutter and psichomics are continued being used here, check /molecular_phenotyles/calling/splicing_calling.ipynb for details.

In this part of pipeline, raw data from leafcutter and psichomics will be first convert to bed format. Then quality control is done by remove features with over some rate of NAs across samples (default 40%), replace NAs in the remaining samples by mean existed values, and then remove introns with less than a minimal variation (default 0.005). Quantile-Quantile Normalization is performed on the QC'd phenotype data.

## Input

### `leafcutter`

The sample_list_intron_usage_perind.counts.gz file generated by previous splicing_calling.ipynb.

### `psichomics`

The psichomics_raw_data.tsv file generated by previous splicing_calling.ipynb.

## Output

### `leafcutter`

`{sample_list}` below refers to the name of the meta-data file input for previous step.

Main output include: 

`{sample_list}_intron_usage_perind.counts.gz_raw_data.qqnorm.txt` a merged table with normalized intron usage ratio for each sample, ready to be phenotype input for tensorQTL in following format:
` a merged table with normalized intron usage ratio for each sample, ready to be phenotype input for tensorQTL in following format:

```
#Chr       start        end        ID                                                      samp1 samp2 samp3 ...
chromosome intron_start intron_end {chr}:{intron_start}:{intron_end}:{cluster_id}_{strand} data  data  data  ...  
```

(the strand info in "ID" column is calculated strandness of junctions via retools in previous workflow)


### `psichomics`

Main output include: 

`psichomics_raw_data_bedded.qqnorm.txt` a merged table with normalized intron usage ratio for each sample, ready to be phenotype input for tensorQTL in following format:
` a merged table with normalized percent spliced in (psi) value for each sample, ready to be phenotype input for tensorQTL in following format:

```
#Chr       start        end        ID                                               samp1 samp2 samp3 ...
chromosome intron_start intron_end {event_type}_{chr}_{strand}_{coordinates}_{gene} data  data  data  ...  
```


## Minimal working example


### For `leafcutter`
Run files here [google drive](https://drive.google.com/drive/folders/1lpcx3eKG2UpauntLUuJ6bMBjHyIhWW_R) with the workflow in /molecular_phenotyles/calling/splicing_calling.ipynb, the output file `sample_fastq_bam_list_intron_usage_perind.counts.gz` is the minimal working example input here.

In [None]:
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/ \
    --ratios output/sample_list_intron_usage_perind.counts.gz \
    --container containers/leafcutter.sif 

### For `psichomics`
Run files here [google drive](https://drive.google.com/drive/folders/1lpcx3eKG2UpauntLUuJ6bMBjHyIhWW_R) with the workflow in /molecular_phenotyles/calling/splicing_calling.ipynb, the output file `psi_raw_data.tsv` is the minimal working example input here.

In [None]:
sos run pipeline/splicing_normalization.ipynb psichomics_norm\
    --cwd output \
    --ratios input/psi_raw_data.tsv \
    --container containers/psichomics.sif

## Command interface

In [1]:
sos run splicing_normalization.ipynb -h

usage: sos run splicing_normalization.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  leafcutter_norm

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --ratios VAL (as path, required)
                        intron usage ratio file wiht samples after QC
  --job-size 1 (as int)
                        Raw data directory, default to the same directory as
                        sample list parameter: data_dir = path(f"{ratios:d}")
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                   

## Setup and global parameters

In [17]:
[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# intron usage ratio file wiht samples after QC
parameter: ratios = path
# optional parameter black list if user want to blacklist some chromosomes and not to analyze
parameter: chr_blacklist = path(".")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')

## `leafcutter_norm`

Documentation: [`leafcutter`](https://davidaknowles.github.io/leafcutter/index.html). The choices of regtool parameters are [discussed here](https://github.com/davidaknowles/leafcutter/issues/127).


### Parameter Annotations

* chr_blacklist: file of blacklisted chromosomes to exclude from analysis, one per line. If none is provided, will default blacklist nothing.

### Things to keep in mind

* Seems leafcutter_norm_1 requires ~ 10G memory (or larger if having large input) or there will be segmentation fault.


In [2]:
[leafcutter_norm_1]
import os
if os.path.isfile(f'{ratios:dd}/black_list.txt'):
    chr_blacklist = f'{ratios:dd}/black_list.txt'
input: ratios, group_by = 'all'
output: f'{ratios}_phenotype_file_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    # code in [leafcutter_norm_1] and [leafcutter_norm_3] is modified from 
    # https://github.com/davidaknowles/leafcutter/blob/master/scripts/prepare_phenotype_table.py
    import sys
    import gzip
    import os
    import numpy as np
    import pandas as pd
    import scipy as sc
    import pickle

    from sklearn import linear_model

    def stream_table(f, ss = ''):
        fc = '#'
        while fc[0] == "#":
            fc = f.readline().strip()
            head = fc.split(ss)

        for ln in f:
            ln = ln.strip().split(ss)
            attr = {}

            for i in range(len(head)):
                try: attr[head[i]] = ln[i]
                except: break
            yield attr

    def get_chromosomes(ratio_file):
        """Get chromosomes from table. Returns set of chromosome names"""
        try: open(ratio_file)
        except:
            sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
            return
        sys.stderr.write("Parsing chromosome names...\n")
        chromosomes = set()
        with gzip.open(ratio_file, 'rt') as f:
                f.readline()
                for line in f:
                    chromosomes.add(line.split(":")[0])
        return(chromosomes)

    def get_blacklist_chromosomes(chromosome_blacklist_file):
        """
        Get list of chromosomes to ignore from a file with one blacklisted
        chromosome per line. Returns list. eg. ['X', 'Y', 'MT']
        """

        if os.path.isfile(chromosome_blacklist_file):
            with open(chromosome_blacklist_file, 'r') as f:
                return(f.read().splitlines())
        else:
            return([])

    def create_phenotype_table(ratio_file, chroms, blacklist_chroms):
        dic_pop, fout = {}, {}
        try: open(ratio_file)
        except:
            sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
            return

        sys.stderr.write("Starting...\n")
        for i in chroms:
            fout[i] = open(ratio_file+".phen_"+i, 'w')
            fout_ave = open(ratio_file+".ave", 'w')
        valRows, valRowsnn, geneRows = [], [], []
        finished = False
        header = gzip.open(ratio_file, 'rt').readline().split()[1:]

        for i in fout:
            fout[i].write("\t".join(["#chr","start", "end", "ID"]+header)+'\n')

        for dic in stream_table(gzip.open(ratio_file, 'rt'),' '):

            chrom = dic['chrom']
            chr_ = chrom.split(":")[0]
            if chr_ in blacklist_chroms: continue
            NA_indices, aveReads = [], []
            tmpvalRow = []

            i = 0
            for sample in header:

                try: count = dic[sample]
                except: print([chrom, len(dic)])
                num, denom = count.split('/')
                if float(denom) < 1:
                    count = "NA"
                    tmpvalRow.append("NA")
                    NA_indices.append(i)
                else:
                    # add a 0.5 pseudocount
                    count = (float(num)+0.5)/((float(denom))+0.5)
                    tmpvalRow.append(count)
                    aveReads.append(count)

            chr_, s, e, clu = chrom.split(":")
            if len(tmpvalRow) > 0:
                fout[chr_].write("\t".join([chr_,s,e,chrom]+[str(x) for x in tmpvalRow])+'\n')
                fout_ave.write(" ".join(["%s"%chrom]+[str(min(aveReads)), str(max(aveReads)), str(np.mean(aveReads))])+'\n')

                valRows.append(tmpvalRow)
                geneRows.append("\t".join([chr_,s,e,chrom]))
                if len(geneRows) % 1000 == 0:
                    sys.stderr.write("Parsed %s introns...\n"%len(geneRows))

        for i in fout:
            fout[i].close()

        matrix = np.array(valRows)

        # write the corrected tables

        sample_names = []

        for name in header:
            sample_names.append(name.replace('.Aligned.sortedByCoord.out.md', ''))

        fout = {}
        for i in chroms:
            fn="%s.qqnorm_%s"%(ratio_file,i)
            print("Outputting: " + fn)
            fout[i] = open(fn, 'w')
            fout[i].write("\t".join(['#Chr','start','end','ID'] + sample_names)+'\n')
        lst = []
        for i in range(len(matrix)):
            chrom, s = geneRows[i].split()[:2]

            lst.append((chrom, int(s), "\t".join([geneRows[i]] + [str(x) for x in  matrix[i]])+'\n'))

        lst.sort()
        for ln in lst:
            fout[ln[0]].write(ln[2])

        fout_run = open("%s_phenotype_file_list.txt"%ratio_file, 'w')

        fout_run.write("#chr\t#dir\n")

        for i in fout:
            fout[i].close()
            fout_run.write("%s\t"%(i))
            fout_run.write("%s.qqnorm_%s\n"%(ratio_file, i))
        fout_run.close()

    ratio_file = f'${_input}'
    chroms = get_chromosomes(f'${_input}')
    blacklist_chroms = get_blacklist_chromosomes(f'${chr_blacklist}')

    create_phenotype_table(ratio_file, chroms, blacklist_chroms)

In [3]:
[leafcutter_norm_2]
import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(f'{_input[0]}',sep = "\t")
molecular_pheno_chr_inv = molecular_pheno_chr_inv.values.tolist()
file_inv = [x[1] for x in molecular_pheno_chr_inv]
input: file_inv # This design is necessary to avoid using for_each, as sos can not take chr number as an input.
output: f'{_input[0]:n}_raw_data.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    head -1 ${_input[0]:r}  > ${_output}
    cat ${_input:r} | grep -v "#Chr" >> ${_output}

## `psichomics_norm`

Documentation: [`psichomics`](http://bioconductor.org/packages/release/bioc/html/psichomics.html)
Consider retaining more information, the only QC on PSI values here are NA removal and a minimal variance filter, however, psichomics team suggested some further QC which can be checked [here](https://github.com/nuno-agostinho/psichomics/issues/450).
For reference, default minimal variance in leafcutter QC is 0.005.

In [None]:
[psichomics_norm_1]
input: ratios, group_by = 'all'
output: f'{cwd}/psichomics_raw_data_bedded.txt' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    library(psichomics)
    library(data.table)

    psi_data <- as.matrix(fread("${_input}"),rownames=1)
    psi_data = as.data.frame(psi_data)
  
    # Process PSI df into bed file for tensorQTL. (This part of code is modified from Ryan Yordanoff's work)
    parsed_events <- parseSplicingEvent(row.names(psi_data))
    
    # Create bedfile df and fill values with parsed values
    bed_file <- data.frame("chr"=parsed_events$chrom,"start"=parsed_events$start,"end"=parsed_events$end,"ID"=row.names(parsed_events),psi_data,check.names = FALSE)
    names(bed_file)[1] <- "#Chr"
    bed_file$'#Chr' <- sub("^", "chr", bed_file$'#Chr')
    row.names(bed_file) <- NULL   
  
    # Create BED file output
    write.table(x=bed_file, file = "${cwd}/psichomics_raw_data_bedded.txt", quote = FALSE, row.names = FALSE, sep = "\t")

In [None]:
[leafcutter_norm_3,psichomics_norm_2]
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    import numpy as np
    import pandas as pd
    from sklearn import preprocessing

    from scipy.stats import norm
    from scipy.stats import rankdata

    def qqnorm(x):
        n=len(x)
        a=3.0/8.0 if n<=10 else 0.5
        return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))

    raw_df = pd.read_csv(f'${_input}',sep = "\t")
    valRows = raw_df.iloc[:,4:]
    headers = list(raw_df)

    drop_list = []
    na_limit = len(valRows.columns)*${na_rate}

    for index, row in valRows.iterrows():

        # If ratio is missing for over 40% of the samples, drop
        if (row.isna().sum()) > na_limit:
            drop_list.append(index)
        # Set missing values as the mean of existed values in a row
        else:
            row.fillna(row.mean())
        # drop introns with variance smaller than some minimal value
        if np.std(row) < ${min_variance}:
            drop_list.append(index)

    # save the intron information and sample values for remaining introns/rows
    newtable = raw_df.drop(drop_list).iloc[:,0:4]
    valRows = valRows.drop(drop_list)

    # scale normalize
    valRows_matrix = []
    for c in (valRows.values.tolist()):
        c = preprocessing.scale(c)
        valRows_matrix.append(c)
    
    # qqnorms on the columns
    matrix = np.array(valRows_matrix)
    for i in range(len(matrix[0,:])):
        matrix[:,i] = qqnorm(matrix[:,i])
    normalized_table = pd.DataFrame(matrix)

    # reset row index for the saved intron infomation so the index will match sample values
    newtable = newtable.reset_index(drop=True)
    # merge the two parts of table
    output = pd.concat([newtable, normalized_table], axis=1)
    output.columns = headers

    # write normalized table
    output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None)