This pipeline module contains codes to process summary statistics from conventional QTL association scan to standard formats for public distribution. It will also export multiple QTL studies to formats easily accessible for data integration methods to query and analyze the summary statistics.

## Overview of design

### Individual xQTL studies

1. Reorganize the QTL marginal statistics to standard formats (see section `Column name standardization` for details).
2. Report separately cis-QTL and trans-QTL. Additionally report filtered results, e.g. QTLs that survived multiple-testing correction.
3. Our column conventions are based on studies such as GTEx and eQTL category. Our design is "modular" in the sense that we do not provide information that could be trivially annotated after, such as **rsID, gene symbol, gene biotype, gene start and end positions**; or for information that can be inferred from other columns such as type of allele (SNP or INDEL).
4. We use GRCh38 reference allele and alternative allele, and the effect allele is adjusted, as necessary, to the alternative allele.
5. Summary statistics will be in conventional TSV format which can be converted to [GWAS-VCF format](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02248-0) as needed.

### Multiple xQTL integration

- We will **NOT** standardize integrative results. Instead we will distribute the original outputs from the data integration methods we chose. 
- For meta-analysis we'll rely on output from [METASOFT](http://genetics.cs.ucla.edu/meta/) using `effect_id` as SNP ID, created from `variant`, `molecular_trait_id` and `molecular_trait_object_id` as needed.

To build internal database for multiple xQTL data-sets, we:

1. Include variants that presents in at least one xQTL. 
2. Unify allele strand and frequency flips to GRCh38 reference.

## Column name standardization

The header of actual output sumstat depends on how we configure it (see section `Input` for details). However, all of them will have the column of `chromosome, position, ref, alt, variant_id, beta, se, pvalue`. 

### Software (input) headers

For example, when the input sumstat is from TensorQTL, the column specification is:

- GENE: Molecular trait identifier.(gene)
- CHR: Variant chromosome.
- POS: Variant chromosomal position (basepairs).
- A0: Variant reference allele (A, C, T, or G).
- A1: Variant alternate allele.
- TSS_D: Distance of the SNP to the gene transcription start site (TSS)
- AF: The allele frequency of this SNPs
- MA_SAMPLES: Number of samples carrying the minor allele
- MA_COUNT: Total number of minor alleles across individuals
- P: Nominal P-value from linear regression
- STAT: Slope of the linear regression
- SE: Standard error of beta

when the input sumstat is from APEX, the column specification is:

- GENE: Molecular trait identifier.(gene)
- CHR: Variant chromosome.
- POS: Variant chromosomal position (basepairs).
- A0: Variant reference allele (A, C, T, or G).
- A1: Variant alternate allele.
- P: Nominal P-value from linear regression
- STAT: Slope of the linear regression
- SE: Standard error of beta

### Our effect level summary

Our proposed xQTL summary statistics fields should include (cf. [our xQTL format draft V3, Jan 2021](https://www.niagads.org/adsp/content/xqtl-fileformats-110921v3sharedxlsx), [eQTL catalog](https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/Columns.md), [GTEx](https://www.gtexportal.org/home/datasets), [metaBrain](https://www.metabrain.nl/)):

* **variant** - The variant ID (chromosome_position_ref_alt) e.g. chr19_226776_C_T. Based on GRCh38 coordinates and reference genome, with 'chr' prefix added to the chromosome number.
* **chromosome** - GRCh38 chromosome name of the variant (e.g. 1,2,3 ...,X).
* **position** - GRCh38 position of the variant.
* **ref** - GRCh38 reference allele.
* **alt** - GRCh38 alternative allele (also the effect allele).
* **imputation_quality** - Optional imputation quality score from the imputation software, can be replaced with NA if not available.
* **molecular_trait_object_id** - For phenotypes with multiple correlated alternatives (multiple alternative transcripts or exons within a gene, multple alternative promoters in txrevise, multiple alternative intons in Leafcutter), this defines the level at which the phenotypes were aggregated. Permutation p-values are calculated across this set of alternatives.  
* **molecular_trait_id** - ID of the molecular trait used for QTL mapping. Depending on the quantification method used, this can be either a gene id, exon id, transcript id or a txrevise promoter, splicing or 3'end event id. Examples: ENST00000356937, ENSG00000008128.  
* **maf** - Minor allele frequency within a QTL mapping context (e.g. cell type or tissues within a study).
* **beta** - Regression coefficient from the linear model.
* **se** - Standard error of the beta.
* **pvalue** - Nominal p-value of association between the variant and the molecular trait.
* **n** - Total number of samples without missing data.
* **ac** - Count of the alternative allele. 
* **ma_samples** - Number of samples carrying at least one copy of the minor allele.

### Trait level QTL summary (multiple-testing corrected)

* **molecular_trait_object_id** 
* **molecular_trait_id** 
* **n_traits** - The number of molecular traits over which permutation p-values were calculated (e.g. the number of transcripts per gene). Note that the permutations are performed accross all molecular traits within the same molecular trait object (e.g. all transcripts of a gene) and the results are reported for the most significant variant and molecular trait pair. 
* **n_variants** - number of genetic variants tested within the cis region of the molecular trait.
* **variant** 
* **chromosome** - GRCh38 chromosome name of the variant (e.g. 1,2,3 ...,X).
* **position** - GRCh38 position of the variant.
* **ref** - GRCh38 reference allele.
* **alt** - GRCh38 alternative allele (also the effect allele).
* **p_perm** - Empirical p-value calculated from 1000 permutations.
* **p_beta** - Estimated empirical p-value based on the beta distribution. This is the column that you want to use for filtering the results. See the FastQTL [paper](http://dx.doi.org/10.1093/bioinformatics/btv722) for more details. 
* **qvalue** - FDR based on Storey's q-value.

Other summary:

- Quantiles of molecular phenotypes

## Some technical notes

1. If there are duplicated INDELs in the summary statistics, they will be removed. For example, two SNPs at 10000 on chr1. one's `A0` is `T`, and `A1` is `TC`. Whereas the other one's `A0` is `TC`, and `A1` is `T`. Both of them will be removed. More about INDEL issues(https://github.com/statgenetics/UKBB_GWAS_dev/issues/81#issuecomment-1015556800). For SNPs, `A0` and `A1` can be easily standardized to ref/alt in GRCh38 reference genome.
2. If duplicated `chr:pos` (GWAS) or `gene:chr:pos` (TWAS) exist, run a recursive match for each pair of them between two summary statistic files (`query`(each of inputs) and `subject` (target file)). 
3. under the same `chr:pos` or `gene:chr:pos`, The variants' `A0` and `A1` are matched by exact, flip, reverse, or flip+reverse models. Only one of them is `True`, the variant in two files are matched. If they are matched by flip or flip+reverse, the sign of `query`'s `STAT` will be inversed. And the `query`'s `A0` and `A1` will be the same as the `subject`'s `A0` and `A1`.  **FIXME: should we standardize it to GRCh38 first?**     

## Pre-requisites

Make sure you install the pre-requisited before running this notebook:

```
pip install cugg
```

## Input

- `--cwd`, the path of working directory
- `--yml_list`, the path to a list of yaml file
-
- `--keep-ambiguous`, boolean. default False. if add --keep-ambiguous parameter, keep ambiguous alleles which can not be decided from flip or reverse, such as A/T or C/G. Otherwise, remove them. 
- `--intersect`, boolean. default False. if add --intersect parameter, output intersect SNPs in all input files.
- `--TARGET_list`a path to a list of reference file, with the column name CHR, POS, REF, ALT that represent the correct reference allele. If these reference is not availble, it can be generated by the TARGET_generation step of this workflow.

- TARGET
   - The target file is a reference summary statistic file or a file with at least variant ID relevant columns. When provided with standard `chr, pos, ref, alt` based on GRCh38, it can serve the purpose to standardize the REF/ALT alleles.

### The minimal format of the input yaml file 

For GWAS summary statistics, 

```
INPUT:
  - ./data/testflip/*.gz:
        build: GRCh38
        variant: chromosome, position, ref, alt
        chromosome: CHR
        position: POS
        ref: A0
        alt: A1
        beta: BETA
        se: SE
        pvalue: P
  - ./data/testflip/flip/snps500_flip.regenie.snp_stats.gz:
        build: GRCh38
        variant: chromosome, position, ref, alt
        chromosome: CHR
        position: POS
        ref: A0
        alt: A1
        beta: BETA
        se: SE
        pvalue: P
        
OUTPUT: data/testflip/output/
```

For xQTL summary statistics, `molecular_trait_object_id` is required because a variant can be made association with multiple molecular traits. 

```
INPUT:
  - data/twas/*.txt:
        build: GRCh38
        variant: chromosome, position, ref, alt
        chromosome: CHR
        position: POS
        ref: A0
        alt: A1
        molecular_trait_id: GENE
        beta: BETA
        se: SE
        pvalue: P 

OUTPUT: ../data/twas/output/
```

There are three parts in the input yaml file.
- INPUT
   - A list of yml file, as the output from yml_generator, each yml file documents a set of input
       - the input summary statistic files with the column names in below. 
       - the input files can be from multiple directory and from different format. The input paths must follow the rules related to Unix shell. the format is to pair the column names with required keys. If not provided, the column names of the input file will be considered as the default keys.
       - The input summary statistic file cannot have duplicated chr:pos
       - The input summary statstic file cannot have # in its header
       -`variant` in yml is the rule to generate a unique identifier for each SNP, the content of variant ID shall be a combination of other columns such as chrom, position, ref, alt, build, but not taken from existing id columns in the original file.

- OUTPUT
   - the path of an output directory for new summary statistic files

## Output

New summary statistic files with common SNPs in all input files. the sign of statistics has been corrected to make it consistent in different data.
   - for each input sumstat file, a standardized version of it will be generated.
   - The generated sumstat files will have header standardized header names. The minimal set of headers will be \"chromosome, position, ref, alt, variant_id, beta, se, pvalue\"
   - The generated sumstat files will be in gz format.

## Memory usage
For merging two sumstat with ~85000 rows and of size of ~5MB, 1 GB of memory is needed 

For merging two sumstat with ~2000000 rows and of size of ~1 GB, at least 50 GB of memory is needed.

## MWE Example command

### Target generation

```
sos run  pipeline/summary_stats_standardizer.ipynb   TARGET_generation  \
      --sumstat-list output/data_intergration/TensorQTL/qced_sumstat_list.txt    \
      --yml-list output/data_intergration/TensorQTL/yml_list.txt    \
      --fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
      --cwd output/data_intergration/TensorQTL  -J 2 -c csg.yml -q csg --mem 50G --walltime 48h &
```

### sumstat_standardization

```
sos run  pipeline/summary_stats_standardizer.ipynb   sumstat_standardization  \
      --sumstat-list output/data_intergration/TensorQTL/qced_sumstat_list.txt    \
      --yml-list output/data_intergration/TensorQTL/yml_list.txt    \
      --TARGET_list output/data_intergration/TensorQTL/TARGET.ref.list \
      --cwd output/data_intergration/TensorQTL  -J 2 -c csg.yml -q csg --mem 50G --walltime 48h &
```

output/data_intergration/TensorQTL/MWE.3.yml.TARGET.ref.list

In [None]:
sos run  pipeline/summary_stats_standardizer.ipynb   sumstat_to_vcf  \
      --sumstat-list  /mnt/vast/hpc/csg/ROSMAP_methy_QTL/data_intergration/TensorQTL/qced_sumstat_list.txt   \
      --cwd /mnt/vast/hpc/csg/ROSMAP_methy_QTL/data_intergration/TensorQTL/  -J 23 -c csg.yml -q csg2 --mem 50G --walltime 48h &

In [None]:
[global]
import pandas as pd 
# Work directory where output will be saved to
parameter: cwd = path("output")

#if add --keep-ambiguous parameter, keep ambiguous alleles which can not be decided from flip or reverse, such as A/T or C/G. Otherwise, remove them.
parameter: keep_ambiguous = False
# if add --intersect parameter, output intersect SNPs in all input files.
parameter: intersect = False
# Containers that contains the necessary packages
parameter: container = ""
parameter: numThreads = 1
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Walltime 
parameter: walltime = '5h'
parameter: mem = '3G'
# The directory of the output sumstat
parameter: sumstat_list = path
sumstat_path = pd.read_csv(sumstat_list,sep = "\t").drop(columns="#chr").values.tolist()
name = pd.read_csv(sumstat_list,sep = "\t").drop(columns="#chr").columns.values.tolist()
## Whether to rename the Chr name.
parameter: remame = False
# Software container option
parameter: container = ""
import time
pd.DataFrame({"A" : list(range(1,23)) + ["X","Y","MT"],"X" : [ f'chr{x}' for x in  list(range(1,23)) + ["X","Y","MT"]]}).to_csv(f'{cwd}/chr_name',"\t",header = None, index = None )

## Workflow codes
The first session is to generate a stand alone target file that can be changed into vcf and then standardized based on GTF. It include three step: Take the union of all snps without allele fliping, create a pseudo-vcf file, use bcftools to standardized the result.

In [None]:
[TARGET_generation_1]
## path to a list of yml file , with columns #chr and dir
parameter: yml_list = path
import pandas as pd
yml_path = pd.read_csv(yml_list,sep = "\t").values.tolist()
chr_inv = [x[0] for x in yml_path]
file_inv = [x[1] for x in yml_path]
input: file_inv , group_by = 1, group_with = "chr_inv"
output: f'{cwd}/{_input:bn}.{_chr_inv}.all_snp.vcf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand = '${ }', stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    import os
    import pandas as pd
    from cugg.sumstat import read_sumstat
    from cugg.sumstat import ss_2_vcf
    from cugg.utils import *
    yml = load_yaml("${_input}")
    input_dict = parse_input(yml['INPUT'])
    ## Remap the YML field name
    new_key = ["molecular_trait_id","chromosome", "position", "ref" ,"alt","variant"]
    old_key = ["GENE" ,  "CHR" ,  "POS" ,   "A0" ,   "A1","SNP"]

    for x in input_dict.values():
        for i,j in zip(old_key,new_key):
            x[i] = x.pop(j)
            x["ID"] = x["ID"].replace(j,i)

    def ss_2_vcf(ss_df,name = "name"):
        ## Geno field
        df = pd.DataFrame()
        if "SNP" not in ss_df.columns:
            ss_df['SNP'] = 'chr'+ss_df.CHR.astype(str).str.strip("chr") + ':' + ss_df.POS.astype(str) + '_' + ss_df.A0.astype(str) + '_' + ss_df.A1.astype(str)
        df[['#CHROM', 'POS', 'ID', 'REF', 'ALT']] = ss_df[['CHR', 'POS', 'SNP', 'A0', 'A1']]
        ## Info field(Empty)
        df['QUAL'] = "."
        df['FILTER'] = "PASS"
        df['INFO'] = "."
        fix_header = ["SNP","A1","A0","POS","CHR","STAT","SE","P"]
        header_list = []
        if "GENE" in ss_df.columns:
            df['ID'] = ss_df['GENE'] + ":" + ss_df['SNP']
            df['INFO'] = "GENE=" + ss_df["GENE"]
            fix_header = ["GENE","SNP","A1","A0","POS","CHR","STAT","SE","P"]
            header_list = ['##INFO=<ID=GENE,Number=1,Type=String,Description="The name of genes">']
        ### Fix headers
        import time
        header = '##fileformat=VCFv4.2\n' + \
        '##FILTER=<ID=PASS,Description="All filters passed">\n' + \
        f'##fileDate={time.strftime("%Y%m%d",time.localtime())}\n'+ \
        '##FORMAT=<ID=STAT,Number=1,Type=Float,Description="Effect size estimate relative to the alternative allele">\n' + \
        '##FORMAT=<ID=SE,Number=1,Type=Float,Description="Standard error of effect size estimate">\n' + \
        '##FORMAT=<ID=P,Number=1,Type=Float,Description="The Pvalue corresponding to ES">' 
        ### Customized Field headers
        for x in ss_df.columns:
            if x not in fix_header:
                Prefix = f'##FORMAT=<ID={x},Number=1,Type='
                Type = str(type(ss_df[x][0])).replace("<class \'","").replace("'>","").replace("numpy.","").replace("64","").capitalize().replace("Int","Integer")
                Surfix = f',Description="Customized Field {x}">'
                header_list.append(Prefix+Type+Surfix)
        ## format and sample field
        df['FORMAT'] = ":".join(["STAT","SE","P"]  + ss_df.drop(fix_header,axis = 1).columns.values.tolist())
        df[f'{name}'] = ss_df.drop( ["SNP","A1","A0","POS","CHR"],axis = 1).astype(str).apply(":".join,axis = 1)
        ## Rearrangment
        df = df[['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO','FORMAT',f'{name}']]
        df = df.sort_values(['#CHROM', 'POS'])
        # Add headers
        header = header + "\n".join(header_list) + "\n"
        return df,header

    ## Verify data uniqueness
    lst_sumstats_file = [ os.path.basename(i) for i in input_dict.keys()]
    if len(set(lst_sumstats_file))<len(lst_sumstats_file):
        raise Exception("There are duplicated names in {}".format(lst_sumstats_file))
    #read all sumstats
    print(input_dict)
    lst_sumstats = {os.path.basename(i):read_sumstat(i,j,) for i,j in input_dict.items()}
    ## Retaining only chrom/pos/ref/alt, and dropping the duplicates (drop twice to reduce mem usage)
    union_snp = pd.concat([x[["CHR" ,  "POS" ,   "A0" ,   "A1","SNP"]].drop_duplicates() for x in lst_sumstats.values() ]).drop_duplicates() 
    ## Create fake header
    union_snp[["STAT","SE","P"]] = 1
    sumstats,header = ss_2_vcf(union_snp,"PseudoVCF")
    with open(${_output:r}, 'w') as f:
        f.write(header)
    sumstats.to_csv(${_output:r}, sep = "\t", header = True, index = False,mode = "a")

In [None]:
[TARGET_generation_2]
## The reference fasta
parameter: fasta = path
output: f'{_input:nn}.TARGET.ref'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = '${ }', stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    echo -e "CHR\tPOS\tA0\tA1" > ${_output}
    bgzip -f ${_input}
    tabix -p vcf -f  ${_input}.gz
    ## our fasta required chr* as chromosome name format
    bcftools annotate --rename-chrs ${cwd}/chr_name ${_input}.gz -Oz | \
    bcftools norm  -N --check-ref ws -f ${fasta} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' >> ${_output}

In [None]:
[TARGET_generation_3]
input: group_by = "all"
output: f'{cwd}/TARGET.ref.list'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand = '${ }', stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    import pandas as pd
    target_path = [${_input:r,}]
    chrom = [x.split(".")[-3].replace("chr","") for x in target_path ]
    pd.DataFrame({"#chr": chrom , "TARGET" : target_path }).to_csv("${_output}","\t",index = False)

In [113]:
[sumstat_standardization]
## path to a list of yml file , with columns #chr and dir
parameter: yml_list = path
import pandas as pd
yml_path = pd.read_csv(yml_list,sep = "\t").values.tolist()
depends: Py_Module('cugg')
parameter: TARGET_list = path
TARGET_path = pd.read_csv(TARGET_list,sep = "\t")
yml_path = pd.read_csv(yml_list,sep = "\t").merge(TARGET_path, on = "#chr").values.tolist()
file_inv = [x[1] for x in yml_path]
TARGET_inv = [x[2] for x in yml_path]
input: file_inv , group_by = 1, group_with = "TARGET_inv"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand = '${ }', stderr = f'{_input}.stderr', stdout = f'{_input}.stdout',container = container
    import os
    import pandas as pd
    from cugg.sumstat import read_sumstat
    from cugg.utils import *
    
    yml = "${_input}"
    keep_ambiguous = ${keep_ambiguous}
    intersect = ${intersect}
    print(yml, keep_ambiguous,intersect)
    #parse yaml
    yml = load_yaml(yml)
    input_dict = parse_input(yml['INPUT'])
    ## Remap the YML field name
    new_key = ["molecular_trait_id","chromosome", "position", "ref", "alt","beta","se","pvalue","variant"]
    old_key = ["GENE" ,  "CHR" ,  "POS" ,   "A0" ,   "A1" ,"STAT","SE","P","SNP"]
    name_map = {old_key[i]: new_key[i] for i in range(len(old_key))}
    name_map_rev = {new_key[i]: old_key[i] for i in range(len(old_key))}
    for x in input_dict.values():
        for k, v in list(x.items()):
            x[name_map_rev.get(k, k)] = x.pop(k)
        for i,j in zip(old_key,new_key):
            x["ID"] = x["ID"].replace(j,i)
    target_dict = "${_TARGET_inv}"
    output_path = yml['OUTPUT'][0]
    lst_sumstats_file = [ os.path.basename(i) for i in input_dict.keys()]
    print('Total number of sumstats: ',len(lst_sumstats_file))
    if len(set(lst_sumstats_file))<len(lst_sumstats_file):
        raise Exception("There are duplicated names in {}".format(lst_sumstats_file))
    #read all sumstats
    print(input_dict)
    lst_sumstats = {os.path.basename(i):read_sumstat(i,j,) for i,j in input_dict.items()}
    nqs = []
    #Readin the reference target file
    subject = check_indels(read_sumstat(target_dict,None,True)[["CHR","POS","SNP","A0","A1"]])
    if "*" in subject.A0.values: 
        raise ValueError(f'illegal character "*" is in the REF column of the TARGET, please check the TARGET file with a reference')
    for query in lst_sumstats.values():
        #check duplicated indels and remove them.
        query = check_indels(query)
        # Set the snp column to be the second column to satisify the requirement of compare_snps() function
        column =  query.pop("SNP")
        query.insert(2,"SNP", column )
        #under the same chr:pos or gene:chr:pos. match A0 and A1 by exact, flip, reverse, or flip+reverse.
        #if duplicated chr_pos or gene_chr_pos exist, run a recursive match for each pair of them between query and subject.
        # If GENE info is in query but not subject, added it.
        if "GENE" in query.columns and "GENE" not  in subject.columns:
            subject = subject.merge(query[["GENE","CHR","POS"]]).drop_duplicates().sort_values("GENE")
            ## It is crucial that the index was built via this function, where the order of A0/A1 was removed. Otherwise will cause error is issue #306
            subject.index = namebyordA0_A1(subject[["GENE","CHR","POS","A0","A1"]],cols=["GENE","CHR","POS","A0","A1"])

        nq,_ = snps_match(query,subject,keep_ambiguous)
        nq = nq.loc[:,~nq.columns.duplicated()] # Remove duplicated columns due to order of columns difference in subject and query
        nqs.append(nq)
    if intersect:
        #get common snps
        common_snps = set.intersection(*[set(nq.SNP) for nq in nqs])
        print('Total number of common SNPs: ',len(common_snps))
        #write out new sumstats
        for output_sumstats,nq in zip(lst_sumstats_file,nqs):
            sumstats = nq[nq.SNP.isin(common_snps)]
            sumstats["variants"] = sumstats.CHR.astype(str) + "_" + sumstats.POS.astype(str) + "_" + sumstats.A0 + "_" + sumstats.A1
            sumstats = sumstats.rename(columns = name_map )
            sumstats.to_csv(os.path.join(output_path, output_sumstats), sep = "\t", header = True, index = False)
    else:
        for output_sumstats,nq in zip(lst_sumstats_file,nqs):
            nq["variants"] = nq.CHR.astype(str) + "_" + nq.POS.astype(str) + "_" + nq.A0 + "_" + nq.A1
            nq = nq.rename(columns = name_map )
            #output match SNPs with target SNPs.
            nq.to_csv(os.path.join(output_path, output_sumstats), sep = "\t", header = True, index = False)
    print('All are done')

In [None]:
[sumstat_to_vcf_1 ]
input:  for_each = "sumstat_path"
output: [f'{path(x):an}.vcf' for x in _sumstat_path]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand = '${ }', stderr = f'{cwd:a}/{path(_sumstat_path[0]):bn}.stderr', stdout = f'{cwd:a}/output.stdout'
    from cugg.sumstat import ss_2_vcf
    import pandas as pd
    from sos.targets import path
    sumstat_path_list = ${_sumstat_path}
    name = ${name}
    ## Remap the YML field name
    new_key = ["molecular_trait_id","chromosome", "position", "ref", "alt","beta","se","pvalue","variant"]
    old_key = ["GENE" ,  "CHR" ,  "POS" ,   "A0" ,   "A1" ,"STAT","SE","P","SNP"]
    name_map = {old_key[i]: new_key[i] for i in range(len(old_key))}
    name_map_rev = {new_key[i]: old_key[i] for i in range(len(old_key))}
    for x,y in zip(sumstat_path_list,name):
        sumstats = pd.read_csv(x,"\t").rename(columns = name_map_rev )
        sumstats,header = ss_2_vcf(sumstats,y)
        with open(f'{path(x):an}.vcf', 'w') as f:
            f.write(header)
        sumstats.to_csv(f'{path(x):an}.vcf', sep = "\t", header = True, index = False,mode = "a")

In [None]:
[sumstat_to_vcf_2]
output: f'{cwd}/{_input[0]:bn}.merged.vcf.gz'.replace(name[0],"_".join(name))
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = '${ }', stderr = f'{cwd:a}/{_output:bn}.stderr', stdout = f'{cwd:a}/{_output:bn}.stdout',container = container
    for i in ${_input:r}; do
    bgzip -k -f $i 
    tabix -p vcf -f  $i.gz; done
    bcftools merge ${" ".join([f'{str(x)}.gz' for x in _input])} --force-samples -m id  -Oz -o ${_output:a}