Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qc html #195

Open
jasonwjohns opened this issue May 16, 2024 · 4 comments
Open

qc html #195

jasonwjohns opened this issue May 16, 2024 · 4 comments

Comments

@jasonwjohns
Copy link

Hey guys,

It's your favorite noob here. I took the filtered.vcf file from the snpArcher output, pruned it for LD using plink, and now I'm trying to feed it back through the qc module to re-produce the nice QC figures you guys coded. If all else fails I can produce my own figures, but your workflow is much cleaner than anything I would write.

I changed the Snakefile a bit, as you can see below. Namely, I deleted the subsample_snps rule and adjusted the plink rule to call the right file.

The qc module runs well until it gets to producing the maps, which is the output I'm most interested in at this point. My workingdir has the config and results directories in it, and the results directory has both the data and summary_stats directories. My command line and output with the error message is below. If I comment out lines 531-573 on the qc_dashboard_interactive.Rmd file the module runs to completion, but neither map is produced.

Thank you in advance for any help you can give!
Jason

`
(snparcher) johns@ qc2 % snakemake -s /Users/johns/snpArcher/workflow/modules/qc2/Snakefile -d . --profile /Users/johns/snpArcher/profiles/default --cores 22
Using profile /Users/johns/snpArcher/profiles/default for setting default command line arguments.
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /opt/homebrew/bin/bash
Provided cores: 22
Rules claiming more threads will be scaled down.
Job stats:
job count


all 1
qc_plots 1
total 2

Select jobs to execute...
Execute 1 jobs...

[Thu May 16 12:06:54 2024]
localrule qc_plots:
input: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.eigenvec, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.eigenval, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.idepth, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.dist, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.dist.id, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.king, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.imiss, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.3.Q, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.2.Q, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_fai_tmp.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.bed, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.bim, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.fam, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_bam_sumstats.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.FILTER.summary, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.het, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.fna.fai, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.coords.txt
output: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_qc.html
jobid: 1
reason: Updated input files: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.coords.txt
wildcards: refGenome=20240328.dmAquExim1.NCBI.hap2, prefix=32-Aquilegia
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=/var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T, mem_mb_reduced=1800

Activating conda environment: .snakemake/conda/e70e9ce0dad747af639b891b79732a30_

processing file: qc_dashboard_interactive.Rmd
|............................................... | 92% [unnamed-chunk-15]
Quitting from lines 531-575 [unnamed-chunk-15] (qc_dashboard_interactive.Rmd)
Error in is.finite():
! default method not implemented for type 'list'
Backtrace:

  1. plotly::ggplotly(...)
  2. plotly:::ggplotly.ggplot(...)
  3. plotly::gg2list(...)
  4. plotly (local) ggplotly_build(p)
  5. base::lapply(data, ggfun("scales_transform_df"), scales = scales)
    ...
  6. ggplot2 (local) f(..., self = self)
  7. base::lapply(df[aesthetics], self$transform)
  8. ggplot2 (local) FUN(X[[i]], ...)
  9. ggplot2 (local) f(..., self = self)
  10. ggplot2:::check_transformation(x, new_x, self$scale_name, axis)

Execution halted
RuleException:
CalledProcessError in file /Users/johns/snpArcher/workflow/modules/qc2/Snakefile, line 172:
Command 'source /Applications/miniforge3/bin/activate '/Users/johns/temp/ccgp_popgen/qc2/.snakemake/conda/e70e9ce0dad747af639b891b79732a30_'; set -euo pipefail; Rscript --vanilla /Users/johns/temp/ccgp_popgen/qc2/.snakemake/scripts/tmppnkqgn4z.qc_dashboard_render.R' returned non-zero exit status 1.
[Thu May 16 12:07:20 2024]
Error in rule qc_plots:
jobid: 1
input: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.eigenvec, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.eigenval, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.idepth, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.dist, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.dist.id, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.king, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.imiss, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.3.Q, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.2.Q, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_fai_tmp.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.bed, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.bim, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.fam, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_bam_sumstats.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.FILTER.summary, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.het, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.fna.fai, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.coords.txt
output: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_qc.html
conda-env: /Users/johns/temp/ccgp_popgen/qc2/.snakemake/conda/e70e9ce0dad747af639b891b79732a30_

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-16T120653.069723.snakemake.log
WorkflowError:
At least one job did not complete successfully.

`

Altered Snakefile:
`
configfile: "config/config.yaml"
include: "common.smk"

samples = snparcher_utils.parse_sample_sheet(config)
REFGENOME = samples['refGenome'].unique().tolist()

rule all:
input:
expand("results/{refGenome}/QC/{prefix}_qc.html", refGenome=REFGENOME, prefix=config['final_prefix'])

rule check_fai:
"""
checks fai file for numeric first column, then do not run plink and rest of workflow if they are all numeric
"""
input:
vcf = "results/{refGenome}/{prefix}_raw.vcf.gz",
fai = "results/{refGenome}/data/genome/{refGenome}.fna.fai",
output:
faiResult = "results/{refGenome}/QC/{prefix}_fai_tmp.txt"
resources:
mem_mb = lambda wildcards, attempt: attempt * 1000
run:
check_contig_names(input.fai, output.faiResult)

rule vcftools_individuals:
input:
vcf = "results/{refGenome}/{prefix}_raw.vcf.gz"
output:
depth = "results/{refGenome}/QC/{prefix}.idepth",
miss = "results/{refGenome}/QC/{prefix}.imiss",
samps = "results/{refGenome}/QC/{prefix}.samps.txt",
summ = "results/{refGenome}/QC/{prefix}.FILTER.summary",
het = "results/{refGenome}/QC/{prefix}.het"
conda:
"envs/vcftools_individuals.yml"
params:
prefix = lambda wc, input: os.path.join(input.vcf.rsplit("/", 1)[0], "QC", wc.prefix),
min_depth = config["min_depth"]
log:
"logs/{refGenome}/QC/vcftools_individuals/{prefix}.txt"
resources:
mem_mb = lambda wildcards, attempt: attempt * 4000
shell:
"""
vcftools --gzvcf {input.vcf} --FILTER-summary --out {params.prefix} &> {log}
vcftools --gzvcf {input.vcf} --out {params.prefix} --depth &>> {log}
vcftools --gzvcf {input.vcf} --out {params.prefix} --het &>> {log}
vcftools --gzvcf {input.vcf} --out {params.prefix} --missing-indv &>> {log}
tail -n +2 {output.depth} | awk '$3>{params.min_depth} {{print $1}}'> {output.samps} 2>> {log}
"""

rule plink:
"""
Call plink PCA.
"""
input:
vcf = "results/{refGenome}/QC/{prefix}_filtered.vcf.gz",
faiResult = "results/{refGenome}/QC/{prefix}_fai_tmp.txt"
params:
prefix = lambda wc, input: input.vcf.replace("_filtered.vcf.gz", "")
output:
bed = "results/{refGenome}/QC/{prefix}.bed",
bim = "results/{refGenome}/QC/{prefix}.bim",
fam = "results/{refGenome}/QC/{prefix}.fam",
eigenvec = "results/{refGenome}/QC/{prefix}.eigenvec",
eigenval = "results/{refGenome}/QC/{prefix}.eigenval",
dist = "results/{refGenome}/QC/{prefix}.dist",
distid = "results/{refGenome}/QC/{prefix}.dist.id",
king = "results/{refGenome}/QC/{prefix}.king"
conda:
"envs/plink.yml"
log:
"logs/{refGenome}/QC/plink/{prefix}.txt"
resources:
mem_mb = lambda wildcards, attempt: attempt * 2000
shell:
#plink 2 for king relatedness matrix (robust to structure) and plink 1.9 for distance matrix
"""
plink2 --vcf {input.vcf} --pca 10 --out {params.prefix} --allow-extra-chr --autosome-num 95 --make-bed --make-king square --const-fid --bad-freqs &> {log}
plink --vcf {input.vcf} --out {params.prefix} --allow-extra-chr --autosome-num 95 --distance square --const-fid &>> {log}
"""

rule setup_admixture:
"""
admixture requires all chromosome names to be integers, this sets them to be 1:n
"""
input:
bim = "results/{refGenome}/QC/{prefix}.bim",
fai = "results/{refGenome}/data/genome/{refGenome}.fna.fai",
output:
bim = "results/{refGenome}/QC/{prefix}.bim_fixed",
bim_back = "results/{refGenome}/QC/{prefix}.bim.orig"
script:
"scripts/contigs4admixture.py"

rule admixture:
"""
Call Admixture. First, make a bim file that has no charecters in the chromosomes
"""
input:
bed = "results/{refGenome}/QC/{prefix}.bed",
bim = "results/{refGenome}/QC/{prefix}.bim",
fam = "results/{refGenome}/QC/{prefix}.fam",
bim_fixed = "results/{refGenome}/QC/{prefix}.bim_fixed",
bim_back = "results/{refGenome}/QC/{prefix}.bim.orig"
output:
admix = "results/{refGenome}/QC/{prefix}.3.Q",
admix2 = "results/{refGenome}/QC/{prefix}.2.Q"
params:
outdir = lambda wc, input: input.bed.rsplit("/", 1)[0]
log:
"logs/{refGenome}/QC/admixture/{prefix}.txt"
resources:
mem_mb = lambda wildcards, attempt: attempt * 4000
conda:
"envs/admixture.yml"
shell:
"""
mv {input.bim_fixed} {input.bim} 2> {log}

    admixture {input.bed} 2 &>> {log}
    admixture {input.bed} 3 &>> {log}

    mv "{wildcards.prefix}".2.* {params.outdir} &>> {log}
    mv "{wildcards.prefix}".3.* {params.outdir} &>> {log}
    """

rule generate_coords_file:
output:
"results/{refGenome}/QC/{prefix}.coords.txt"
run:
out_df = samples.loc[(samples['refGenome'] == wildcards.refGenome)][["BioSample", "long", "lat"]]
out_df.drop_duplicates("BioSample", inplace=True)
out_df.dropna(subset=["long", "lat"], thresh=1, inplace=True)
out_df.to_csv(output[0], index=False, sep="\t", header=False)

rule qc_plots:
"""
Call plotting script
"""
input:
eigenvec = "results/{refGenome}/QC/{prefix}.eigenvec",
eigenval = "results/{refGenome}/QC/{prefix}.eigenval",
depth = "results/{refGenome}/QC/{prefix}.idepth",
dist = "results/{refGenome}/QC/{prefix}.dist",
distid = "results/{refGenome}/QC/{prefix}.dist.id",
king = "results/{refGenome}/QC/{prefix}.king",
miss = "results/{refGenome}/QC/{prefix}.imiss",
admix3 = "results/{refGenome}/QC/{prefix}.3.Q",
admix2 = "results/{refGenome}/QC/{prefix}.2.Q",
faiResult = "results/{refGenome}/QC/{prefix}_fai_tmp.txt",
bed = "results/{refGenome}/QC/{prefix}.bed",
bim = "results/{refGenome}/QC/{prefix}.bim",
fam = "results/{refGenome}/QC/{prefix}.fam",
sumstats = "results/{refGenome}/QC/{prefix}_bam_sumstats.txt",
summ = "results/{refGenome}/QC/{prefix}.FILTER.summary",
het = "results/{refGenome}/QC/{prefix}.het",
fai = "results/{refGenome}/QC/{prefix}.fna.fai",
coords = get_coords_if_available
params:
prefix = lambda wc, input: input.het[:-4],
nClusters = config['nClusters'],
GMKey = config['GoogleAPIKey']
resources:
mem_mb = lambda wildcards, attempt: attempt * 2000
output:
qcpdf = "results/{refGenome}/QC/{prefix}_qc.html"
conda:
"envs/qc.yml"
script:
"scripts/qc_dashboard_render.R"
`

@cademirch
Copy link
Collaborator

Hi Jason,

Happy to help - for the future, could you please wrap your code/errors in triple back ticks to make a code block? It makes it much easier to read :)

As for the error, I'm not super sure. It may be easiest if you can supply the pruned VCF and other files so I can try to run the workflow and diagnose. Lmk if that would work for you.

@jasonwjohns
Copy link
Author

jasonwjohns commented May 17, 2024 via email

@cademirch
Copy link
Collaborator

Thanks Jason. I'll try to take a look today if I get a chance!

@jasonwjohns
Copy link
Author

jasonwjohns commented May 17, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants