# Factor analysis using Bi-Cross validation

## Overview

This module use an implement of the following paper
> Owen, Art & Wang, Jingshu. (2015). Bi-Cross-Validation for Factor Analysis. Statistical Science. 31. 10.1214/15-STS539. 

The software used is 
> A versatile toolkit for molecular QTL mapping and meta-analysis at scale
Corbin Quick, Li Guan, Zilin Li, Xihao Li, Rounak Dey, Yaowu Liu, Laura Scott, Xihong Lin
bioRxiv 2020.12.18.423490; doi: https://doi.org/10.1101/2020.12.18.423490


## Cautions

- Notice that the command options are different from those on the APEX website documentation. The commands on the documentation page does not work (last updated September 2021). The commands below were constructed and tested by our team based on our understanding of the program, without input from APEX authors.


## Input and output

1. An indexed bed.gz file with the same format as [PEER factor analysis](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/covariate/PEER_factor.html).
2. A cov file with the same format as [PEER factor analysis](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/covariate/PEER_factor.html).
3. An indexed vcf.gz file.

## Output 

1. A cov.gz file with the same format as [PEER factor analysis](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/covariate/PEER_factor.html).

## Minimal working example
An MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1yjTwoO0DYGi-J9ouMsh9fHKfDmsXJ_4I?usp=sharing).
The singularity image (sif) for running this MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1mLOS3AVQM8yTaWtCbO8Q3xla98Nr5bZQ)
 
Both of the `ALL.log2cpm.bed.chr12.mol_phe.bed.gz` and `ALL.log2cpm.bed.chr12.mol_phe.bed.gz.tbi` are needed. 

Because the MWE only contains 10 genes but 400+ samples. The computed N will be far greater than the number of genes. Therefore in the MWE the N is fixed to be 3.


In [None]:
sos run pipeline/BiCV_factor.ipynb BiCV \
   --cwd output \
   --phenoFile ALL.log2cpm.bed.chr12.mol_phe.bed.gz  \
   --container containers/apex.sif  \
   --N 3

## Command interface

In [1]:
sos run BiCV_factor.ipynb -h

usage: sos run BiCV_factor.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  fake_vcf
  BiCV

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files. MUST BE FULL
                        PATH
  --phenoFile VAL (as path, required)
                        The molecular phenotype matrix
  --covFile . (as path)
                        The covariate file
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of thr

In [None]:
[global]
# The output directory for generated files. MUST BE FULL PATH
parameter: cwd = path("output")
# The molecular phenotype matrix
parameter: phenoFile = path
# The covariate file
parameter: covFile = path(".")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
parameter: name = ""
# N factors, if not specify, calculated based on sample size according to GTeX
parameter: N = 0
# The number of iteration
parameter: iteration = 10

import pandas as pd
data = pd.read_csv(phenoFile,"\t",index_col = 3).drop(["#chr","start","end"],axis = 1)
if N == 0:
    if len(data.columns) < 150:
        N = 15
    elif len(data.columns) < 250:
        N = 30
    elif len(data.columns) < 350:
        N = 45
    else:
        N = 60

In [None]:
[fake_vcf]
# For cluster jobs, number commands to run per job
import time
input: phenoFile
output: f'{cwd:a}/{_input:bn}.fake.vcf.gz'
#task: trunk_workers = 1,trunk_size = job_size , walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("dplyr")
    library("readr")
    ## Add fake header
    cat(paste("##fileformat=VCFv4.2\n##fileDate=$[time.strftime("%Y%m%d",time.localtime())]\n##source=FAKE\n"), file=$[_output:nr], append=FALSE)
    ## Add colnames based on bed
    pheno = read_delim("$[_input]", delim = "\t",n_max = 1)
    colnames(pheno)[1:3] = c("#CHROM","POS","ID") 
    pheno = cbind(pheno[1:3]%>%mutate(REF = "A", ALT = "C", QUAL = ".",FILTER = ".", INFO = "PR", FORMAT = "GT"),pheno[,5:ncol(pheno)])
    pheno%>%write_delim($[_output:nr],delim = "\t",col_names = T, append = T)
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    bgzip -f $[_output:n]
    tabix -p vcf $[_output]

In [3]:
[BiCV]
input:  output_from("fake_vcf"),phenoFile
output: f'{cwd:a}/{_input[1]:bnn}.BiCV.cov.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    apex factor \
        --out $[_output[0]:nn] \
        --iter $[iteration] \
        --factors $[N] \
        --bed $[_input[1]] \
        --vcf $[_input[0]] \
        --threads $[numThreads]  $[ f'--cov {covFile}' if covFile.is_file() else '']