Skip to content

add assembly#146

Merged
pdimens merged 70 commits intomainfrom
cloudspades
Oct 22, 2024
Merged

add assembly#146
pdimens merged 70 commits intomainfrom
cloudspades

Conversation

@pdimens
Copy link
Copy Markdown
Owner

@pdimens pdimens commented Oct 20, 2024

in addition to fixing some bugs, standardizing formatting, simplyfing the logic that creates workflow summaries, this PR adds:

  • harpy metassembly is now harpy assembly with an option for metassembly via a flag
  • harpy assembly can perform single-sample assembly
  • fixes Add assembler via cloudspades #145
  • impute params have extra column
  • impute columns not fixed order
  • rm dep on pandas (no longer uses paramspace)

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new assembly workflow with enhanced configuration management.
    • Added support for multiple assembly methods, including CloudSpades and SPAdes.
    • Enhanced logging and summary reporting for workflow execution.
  • Bug Fixes

    • Improved error handling and validation for input parameters across various commands.
  • Documentation

    • Updated command-line help texts for clarity and conciseness.
  • Refactor

    • Transitioned to a list-based approach for summary generation in workflows, improving readability and maintainability.
  • Chores

    • Removed deprecated files and dependencies to streamline the project structure.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (11)
harpy/bin/bx_stats.py (1)

Line range hint 1-180: Overall improvement with potential for further optimization

The change to delete processed keys from the dictionary is a good step towards better memory management. However, there might be room for further optimizations:

  1. Consider using a generator or itertools for processing large files to reduce memory usage further.
  2. The invalidBX handling could potentially be optimized to reduce dictionary lookups.
  3. The script could benefit from type hinting for better code readability and maintenance.

These suggestions are not critical but could be considered for future improvements.

Would you like me to provide code examples for any of these potential optimizations?

harpy/snakefiles/metassembly.smk (7)

16-22: LGTM! Consider using a dictionary for assembler-specific configurations

The addition of these configuration variables improves the flexibility of the workflow. The cloudspades boolean is a good way to handle conditional execution.

To further improve maintainability, consider using a dictionary for assembler-specific configurations:

assembler_configs = {
    "cloudspades": {"dir": f"{outdir}/cloudspades_assembly"},
    "spades": {"dir": f"{outdir}/spades_assembly"}
}
spadesdir = assembler_configs[metassembly]["dir"]

This approach would make it easier to add more assemblers in the future.


33-36: Good addition of threads, consider using the containerized variable

The addition of the threads parameter is a good improvement for performance. However, the container parameter is set to None, which seems inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

- container:
-     None
+ container:
+     containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


75-78: Good addition of conda environment, consider using the containerized variable

The addition of a specific conda environment for Spades is excellent for reproducibility. However, the container parameter is set to None, which seems inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

- container:
-     None
+ container:
+     containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


84-105: Excellent updates to the spades_assembly rule, consider using the containerized variable

The changes to the spades_assembly rule are well-thought-out and improve both the configurability and resource management of the assembly step. The use of corrected fastq files is a good practice for improving assembly quality.

However, the container parameter is set to None, which seems inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

- container:
-     None
+ container:
+     containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


107-128: Excellent addition of CloudSpades assembly support

The new cloudspades_assembly rule is a great addition that provides flexibility in the assembly process. The use of the spadesdir variable ensures consistency with the chosen assembler, and the inclusion of both contigs and scaffolds in the output is comprehensive.

A minor suggestion for improvement:

Consider adding a comment explaining the use of --gemcode1-1 and --gemcode1-2 options, as these are specific to linked-read data:

# Use --gemcode1-1 and --gemcode1-2 options for linked-read data
shell:
    "spades.py --meta -t {threads} -m {params.mem} -k {params.k} {params.extra} --gemcode1-1 {input.fastq_R1} --gemcode1-2 {input.fastq_R2} -o {params.outdir} > {log}"

This comment would help future maintainers understand the purpose of these options.


239-247: Good addition of detailed workflow summary

The updates to the workflow_summary rule provide more comprehensive information about the workflow steps, which is excellent for documentation and reproducibility. The use of f-strings improves readability.

A minor suggestion for improvement:

Consider using a list comprehension to make the code more concise:

summary = [
    "The harpy metassembly workflow ran using these parameters:",
    f"FASTQ inputs were sorted by their linked-read barcodes:\n\tsamtools import -T \"*\" FQ1 FQ2 |\n\tsamtools sort -O SAM -t {params.bx} |\n\tsamtools fastq -T \"*\" -1 FQ_out1 -2 FQ_out2",
    f"Barcoded-sorted FASTQ files had \"-1\" appended to the barcode to make them Athena-compliant:\n\tsed 's/{params.bx}:Z:[^[:space:]]*/&-1/g' FASTQ | bgzip > FASTQ_OUT"
]

This approach reduces the number of append calls and makes the code more Pythonic.


254-267: Excellent comprehensive workflow summary

The final updates to the workflow_summary rule provide a thorough overview of the entire workflow, including alignment, interleaving, and Athena execution steps. The inclusion of the Snakemake workflow call is particularly valuable for reproducibility.

A minor suggestion for improvement:

Consider using a dictionary to store the summary steps, which can make the code more maintainable and easier to update in the future:

summary_steps = {
    "align": "Original input FASTQ files were aligned to the metagenome using BWA:\n\tbwa mem -C -p spades.contigs FQ1 FQ2 | samtools sort -O bam -",
    "interleaved": "Barcode-sorted Athena-compliant sequences were interleaved with seqtk:\n\tseqtk mergepe FQ1 FQ2 > INTERLEAVED.FQ",
    "athena": "Athena ran with the config file Harpy built from the files created from the previous steps:\n\tathena-meta --config athena.config",
    "snakemake": f"The Snakemake workflow was called via command line:\n\t{config['workflow_call']}"
}

summary.extend(summary_steps.values())

This approach makes it easier to add, remove, or modify steps in the future without changing the structure of the code.

harpy/snakefiles/impute.smk (3)

80-98: Enhanced flexibility and organization in impute rule

The updates to the impute rule significantly improve its flexibility and organization:

  1. The use of {paramset} and {contig} in paths allows for multiple parameter configurations.
  2. Temporary output directories for intermediate files improve cleanup.
  3. Parameters now reference stitch_params[wc.paramset], allowing for easier management of different STITCH configurations.

These changes greatly enhance the rule's functionality and maintainability.

Consider adding a comment explaining the structure of stitch_params for easier understanding and maintenance.


122-137: New contig_report rule enhances granular analysis

The addition of the contig_report rule is a valuable enhancement:

  1. It allows for contig-specific reporting, providing more detailed analysis.
  2. The use of stitch_params[wc.paramset] maintains consistency with other rules.

This new rule will greatly aid in understanding imputation results at a finer level.

Consider adding a brief comment explaining the purpose and output of this rule for better documentation.


205-259: Improved reporting and summary generation

The updates to impute_reports and workflow_summary rules significantly enhance the workflow's output:

  1. The impute_reports rule now uses {paramset}, maintaining consistency.
  2. The workflow_summary rule has been refactored to use a list-based approach for summary generation, improving readability and maintainability.
  3. The summary now includes more detailed information about the workflow execution.

These changes will greatly improve the usability and interpretability of the workflow results.

Consider adding comments to explain the structure of the summary list for easier future modifications.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between de695a7 and abde854.

📒 Files selected for processing (3)
  • harpy/bin/bx_stats.py (1 hunks)
  • harpy/snakefiles/impute.smk (6 hunks)
  • harpy/snakefiles/metassembly.smk (10 hunks)
🧰 Additional context used
🔇 Additional comments (15)
harpy/bin/bx_stats.py (1)

54-55: Excellent memory management improvement!

The addition of the line to delete the key after writing its statistics is a great improvement for RAM management. This change aligns well with the existing comment about RAM usage considerations and will help prevent memory issues when processing large datasets.

harpy/snakefiles/metassembly.smk (7)

132-134: LGTM! Good use of spadesdir variable

The updates to the bwa_index rule are well-implemented. Using the spadesdir variable ensures that the correct assembly output is used for indexing, regardless of the chosen assembler. The use of multiext for defining multiple output files is a good Snakemake practice.


144-148: LGTM! Good use of spadesdir variable and collect function

The updates to the bwa_align rule are well-implemented. Using the spadesdir variable ensures consistency with the chosen assembler. The use of the collect function for input fastq files is a good Snakemake practice for handling multiple input files efficiently.


183-186: LGTM! Good use of spadesdir variable

The updates to the input paths in the athena_config rule are well-implemented. Using the spadesdir variable ensures consistency with the chosen assembler, maintaining coherence throughout the workflow.


209-211: LGTM! Good use of spadesdir variable and multiext function

The updates to the input paths in the athena rule are well-implemented. Using the spadesdir variable ensures consistency with the chosen assembler. The use of the multiext function for input files is a good Snakemake practice for handling multiple related input files efficiently.


248-253: Excellent addition of assembler-specific information to the workflow summary

The updates to include assembler-specific information in the workflow summary are well-implemented. The use of conditional logic to provide the correct command based on the chosen assembler (CloudSpades or metaSPAdes) ensures that the summary is accurate and complete.

This improvement significantly enhances the documentation of the workflow, making it easier for users to understand and reproduce the assembly process.


176-177: 🛠️ Refactor suggestion

Consider using the containerized variable for the container parameter

In the interleave_fastq rule, the container parameter is set to None, which seems inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

- container:
-     None
+ container:
+     containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.

Likely invalid or redundant comment.


51-52: 🛠️ Refactor suggestion

Consider using the containerized variable for the container parameter

Similar to the previous rule, setting the container parameter to None seems inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

- container:
-     None
+ container:
+     containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.

Likely invalid or redundant comment.

harpy/snakefiles/impute.smk (7)

13-14: Improved workflow robustness with wildcard constraints

The addition of wildcard constraints for paramset and contig enhances the safety and predictability of the workflow. These constraints prevent potential issues like directory traversal and unexpected pattern matches.


16-26: Improved variable management and file handling

These changes enhance the code's robustness and maintainability:

  1. The use of os.path.join for envdir improves cross-platform compatibility.
  2. The introduction of stitch_params suggests a more structured approach to managing STITCH parameters.
  3. The use of a with statement for file reading ensures proper file handling.

These improvements address previous review comments and align with best practices.


54-54: Enhanced dependency management in alignment_list rule

The addition of bailist as an input ensures that BAM index files are created before the alignment list is generated. This improves the workflow's dependency management and prevents potential race conditions.


66-75: Improved consistency in stitch_conversion rule

The changes to the stitch_conversion rule align it with the new wildcard structure:

  1. The output file naming now uses {contig} instead of {part}.
  2. The shell command correctly uses {wildcards.contig} to filter regions.

These changes improve the consistency and clarity of the workflow.


108-112: Improved index_vcf rule with paramset support and additional output

The updates to the index_vcf rule enhance its functionality:

  1. Input and output paths now use {paramset}, aligning with the new structure.
  2. A new output for contig stats has been added, which is useful for downstream analysis.

These changes improve the rule's integration with the overall workflow and provide additional valuable data.


145-157: Improved concat_list and merge_vcf rules with paramset support

The updates to the concat_list and merge_vcf rules enhance their integration with the workflow:

  1. Both rules now use {paramset} in their paths, aligning with the new structure.
  2. The merge_vcf rule now outputs a BCF file, which is typically more efficient for large datasets.

These changes improve the consistency of the workflow and potentially its performance.


167-194: Consistent updates and enhanced statistics in multiple rules

The changes to index_merged, general_stats, and compare_stats rules improve the workflow:

  1. All rules now consistently use {paramset} in their paths, aligning with the new structure.
  2. The compare_stats rule includes an additional output for info scores, which is valuable for assessing imputation quality.

These updates maintain workflow consistency and provide more comprehensive statistics for analysis.

Comment thread harpy/snakefiles/metassembly.smk
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (15)
harpy/snakefiles/qc.smk (4)

Line range hint 61-80: LGTM: Fastp rule adapted for deconvolution.

The fastp rule has been effectively modified to support deconvolution by outputting an interleaved fastq file. This aligns well with the PR objectives. The use of --stdout and file redirection is an efficient approach.

Consider adding a comment explaining why the output is temporary and interleaved, to improve code readability.


Line range hint 82-98: LGTM: New deconvolve rule added.

The new deconvolve rule successfully implements the deconvolution feature using QuickDeconvolution, as outlined in the PR objectives. The use of configuration parameters and temporary output is well-handled.

Consider adding a brief comment explaining the purpose and function of QuickDeconvolution for better code documentation.


Line range hint 123-141: LGTM: New barcode analysis rules added.

The new rules for barcode checking and reporting add valuable functionality to the workflow. The separation of counting and reporting into distinct rules is a good design choice.

Consider adding a brief comment explaining the purpose of the barcode analysis and how it fits into the overall workflow.


184-200: LGTM: Improved workflow summary generation.

The restructured workflow_summary rule now provides a more comprehensive overview of the entire process, including details about deconvolution and interleaved file handling. The inclusion of the Snakemake workflow call command enhances reproducibility.

Consider using a list comprehension or join() method for constructing the summary string to make the code more concise and Pythonic.

harpy/snakefiles/simulate_snpindel.smk (3)

46-48: Consistent configuration structure with a minor suggestion

The changes to snp_constraint and titv_ratio configuration are consistent with the new structure, improving organization. However, consider using a more specific variable name instead of ratio to enhance clarity:

titv_ratio = config["snp"].get("titv_ratio", None)
variant_params += f" -titv_ratio {titv_ratio}" if titv_ratio else ""

This change would make the code more self-documenting and reduce potential confusion with other ratio variables.


53-58: Enhanced indel simulation control with consistent configuration

These changes improve the indel simulation capabilities:

  1. The configuration structure for indel_ratio is now consistent with other parameters.
  2. The addition of size_alpha and size_constant provides more detailed control over indel size simulation.

These enhancements align well with the PR objectives of improving functionality. However, consider using a more specific variable name instead of ratio to enhance clarity:

indel_ratio = config["indel"].get("indel_ratio", None)
variant_params += f" -ins_del_ratio {indel_ratio}" if indel_ratio else ""

This change would make the code more self-documenting and reduce potential confusion with other ratio variables.


175-189: Comprehensive workflow summary generation

The new run block in the workflow_summary rule is an excellent addition:

  1. It generates a detailed summary of the workflow execution, including crucial information like the genome used, heterozygosity, and commands executed.
  2. The use of a list to build the summary improves code readability and maintainability.
  3. Writing the summary to a file is beneficial for documentation and debugging purposes.

These improvements align perfectly with the PR objectives of simplifying the logic for creating workflow summaries.

However, to ensure the summary file is always created successfully, consider adding a check to create the directory if it doesn't exist:

import os

os.makedirs(os.path.dirname(f"{outdir}/workflow/simulate.snpindel.summary"), exist_ok=True)
with open(f"{outdir}/workflow/simulate.snpindel.summary", "w") as f:
    f.write("\n\n".join(summary))

This change will prevent potential FileNotFoundError if the directory structure doesn't exist.

harpy/snakefiles/metassembly.smk (8)

1-2: Consider removing or utilizing the containerized variable

The containerized variable is defined but not used within this file. If it's intended for use in other files, consider moving it to a common configuration file. If it's not needed, remove it to avoid confusion.


16-22: LGTM! Consider minor readability improvement

The new configuration variables enhance the workflow's flexibility. The conditional logic for cloudspades and spadesdir is well-implemented.

For improved readability, consider using a ternary operator for spadesdir:

spadesdir = f"{outdir}/{'cloudspades' if cloudspades else 'spades'}_assembly"

This change would make the logic more concise and easier to read at a glance.


33-36: Good addition of threads, consider using containerized for consistency

The addition of the threads parameter is beneficial for parallelization. However, setting container: None seems inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

container:
    containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


75-78: Good addition of conda, consider using containerized for consistency

The addition of the conda parameter is beneficial for environment management. However, setting container: None is again inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

container:
    containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


84-105: LGTM! Consider using containerized for consistency

The updates to input and output paths, and the addition of threads and resources parameters are good improvements for resource management and consistency. The conda parameter is also consistent with the previous rule.

However, setting container: None is again inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

container:
    containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


107-128: LGTM! Consider adding a comment for clarity

The addition of the cloudspades_assembly rule is well-implemented and consistent with the new metassembly option. The use of threads, resources, and conda parameters is good for resource management and environment consistency.

To improve clarity, consider adding a comment explaining why this rule uses the original input FASTQ files (FQ1 and FQ2) instead of the error-corrected files used in the spades_assembly rule. This will help other developers understand the intentional difference in input file usage between the two assembly methods.


161-167: LGTM for path updates. Consider using containerized for consistency

The updates to the input and output file paths in the index_alignment rule are consistent with the previous changes. However, setting container: None is again inconsistent with the containerized variable defined at the beginning of the file.

Consider using the containerized variable for the container parameter:

container:
    containerized

This change would ensure consistency with the containerization strategy defined earlier in the file.


239-265: LGTM! Comprehensive workflow summary

The updates to the workflow_summary rule significantly improve the detail and accuracy of the workflow summary. The new summary provides valuable information about the assembly process, including the choice between CloudSpades and MetaSPAdes, which is excellent for transparency and debugging.

For consistency, consider using f-strings for all string formatting. For example, line 242 could be changed to:

bxsort += f"\tsamtools sort -O SAM -t {params.bx} |\n"

This minor change would make the string formatting consistent throughout the summary generation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between abde854 and d8a3ac9.

📒 Files selected for processing (22)
  • harpy/snakefiles/align_bwa.smk (4 hunks)
  • harpy/snakefiles/align_ema.smk (4 hunks)
  • harpy/snakefiles/align_strobealign.smk (3 hunks)
  • harpy/snakefiles/deconvolve.smk (2 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (4 hunks)
  • harpy/snakefiles/impute.smk (6 hunks)
  • harpy/snakefiles/metassembly.smk (10 hunks)
  • harpy/snakefiles/phase.smk (2 hunks)
  • harpy/snakefiles/preflight_bam.smk (2 hunks)
  • harpy/snakefiles/preflight_fastq.smk (2 hunks)
  • harpy/snakefiles/qc.smk (2 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (2 hunks)
  • harpy/snakefiles/simulate_snpindel.smk (3 hunks)
  • harpy/snakefiles/simulate_variants.smk (2 hunks)
  • harpy/snakefiles/snp_freebayes.smk (4 hunks)
  • harpy/snakefiles/snp_mpileup.smk (4 hunks)
  • harpy/snakefiles/sv_leviathan.smk (2 hunks)
  • harpy/snakefiles/sv_leviathan_pop.smk (2 hunks)
  • harpy/snakefiles/sv_naibr.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_phase.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_pop.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_pop_phase.smk (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (14)
  • harpy/snakefiles/align_bwa.smk
  • harpy/snakefiles/deconvolve.smk
  • harpy/snakefiles/demultiplex_gen1.smk
  • harpy/snakefiles/phase.smk
  • harpy/snakefiles/preflight_bam.smk
  • harpy/snakefiles/preflight_fastq.smk
  • harpy/snakefiles/simulate_linkedreads.smk
  • harpy/snakefiles/snp_freebayes.smk
  • harpy/snakefiles/sv_leviathan.smk
  • harpy/snakefiles/sv_leviathan_pop.smk
  • harpy/snakefiles/sv_naibr.smk
  • harpy/snakefiles/sv_naibr_phase.smk
  • harpy/snakefiles/sv_naibr_pop.smk
  • harpy/snakefiles/sv_naibr_pop_phase.smk
🧰 Additional context used
🔇 Additional comments (48)
harpy/snakefiles/simulate_variants.smk (4)

Line range hint 1-139: Overall improvements in flexibility and documentation.

The changes in this file significantly enhance the flexibility and configurability of the variant simulation workflow. Key improvements include:

  1. More dynamic configuration access using the variant variable.
  2. Restructured parameter handling for different variant types.
  3. Addition of a comprehensive workflow summary feature.

These changes align well with the PR objectives of standardizing formatting and simplifying workflow summaries. The new assembly functionality is not directly visible in this file but may be reflected in the overall configuration structure.

Make sure to address the minor issues pointed out in the previous comments, particularly regarding error handling and syntax. Also, ensure that the configuration changes (e.g., "value" to "ratio", "exclude_chr" to "excluded_chromosomes") are consistently applied across the entire project.


33-33: Verify impact of changing "exclude_chr" to "excluded_chromosomes".

The configuration key for excluded chromosomes has been changed from "exclude_chr" to "excluded_chromosomes". Ensure that this change is reflected in all configuration files and that it doesn't break any existing functionality or assumptions in other parts of the code.

#!/bin/bash
# Search for uses of "exclude_chr" and "excluded_chromosomes" across the codebase
rg -n 'exclude_chr|excluded_chromosomes' --type py --type yaml

16-16: Add error handling for variant assignment.

The current implementation assumes that config["workflow"] always contains at least two elements when split. To improve robustness, consider adding error handling:

workflow_parts = config["workflow"].split()
if len(workflow_parts) > 1:
    variant = workflow_parts[1]
else:
    raise ValueError("Invalid workflow configuration: 'workflow' must contain at least two space-separated elements.")

This change will prevent potential IndexErrors and provide a clear error message if the configuration is incorrect.

✅ Verification successful

Add error handling for variant assignment.

The current implementation assumes that config["workflow"] always contains at least two elements when split. To improve robustness, consider adding error handling:

workflow_parts = config["workflow"].split()
if len(workflow_parts) > 1:
    variant = workflow_parts[1]
else:
    raise ValueError("Invalid workflow configuration: 'workflow' must contain at least two space-separated elements.")

This change will prevent potential IndexErrors and provide a clear error message if the configuration is incorrect.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the usage of config["workflow"] across the codebase
rg -n 'config\["workflow"\]' --type py

Length of output: 117


20-20: Verify impact of changing heterozygosity from "value" to "ratio".

The heterozygosity configuration key has been changed from "value" to "ratio". Ensure that this change is consistent across the entire codebase and that it doesn't break any existing functionality or assumptions in other parts of the code.

✅ Verification successful

Heterozygosity configuration key updated to "ratio" verified. No residual references to "value" found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for uses of config["heterozygosity"] across the codebase
rg -n 'config\["heterozygosity"\]' --type py

Length of output: 2108


Script:

#!/bin/bash
# Corrected search for uses of config["heterozygosity"] across the codebase
rg -n 'config\["heterozygosity"\]' --type py

Length of output: 46


Script:

#!/bin/bash
# Search for various usages of heterozygosity in the codebase

# Search for config["heterozygosity"]
rg -n 'config\["heterozygosity"\]' --type py

# Search for config['heterozygosity']
rg -n 'config\['\''heterozygosity'\''\]' --type py

# Search for config.heterozygosity
rg -n 'config\.heterozygosity' --type py

# Search for standalone heterozygosity assignments or usages
rg -n '\bheterozygosity\b' --type py

Length of output: 5130

harpy/snakefiles/qc.smk (2)

Line range hint 16-26: LGTM: Improved path handling and configuration.

The changes in this section improve path handling by using os.path.join and introduce new configuration variables for deconvolution. These modifications enhance cross-platform compatibility and extend the functionality of the workflow.


Line range hint 1-200: Overall: Excellent implementation of new features.

This PR successfully implements the new assembly and deconvolution features as outlined in the objectives. The changes are well-structured and align with best practices in Snakemake workflow design. The addition of barcode analysis and improved summary generation enhances the overall functionality and usability of the workflow.

Minor suggestions have been made for improving code documentation and readability, but these are not critical issues. The overall quality of the implementation is high, and the changes are approved.

harpy/snakefiles/simulate_snpindel.smk (5)

16-19: Improved configuration structure and path handling

The changes in these lines enhance the code in two ways:

  1. Using os.path.join() for envdir improves cross-platform compatibility.
  2. The new configuration structure for SNP and indel parameters is more organized and easier to maintain.

These improvements align well with the PR objectives of standardizing formatting and simplifying logic.


40-41: Consistent configuration structure for SNP and indel parameters

The changes to snp_count and indel_count configuration are consistent with the new structure observed earlier. This organization improves the manageability of SNP and indel parameters separately, which aligns with the PR's goal of standardizing formatting and simplifying logic.


64-64: Improved configuration key naming

The change from "exclude_chr" to "excluded_chromosomes" for the configuration key is a good improvement. It makes the purpose of the parameter more clear and consistent with natural language, enhancing code readability and maintainability.


169-174: Comprehensive workflow summary parameters

The addition of new parameters to the workflow_summary rule is an excellent improvement:

  1. It allows for a more detailed and comprehensive summary of the workflow.
  2. The use of f-strings with conditional expressions is concise and readable.
  3. The variable names are clear and descriptive, enhancing code maintainability.

These changes align well with the PR objectives of improving the workflow summary generation.


Line range hint 1-189: Overall assessment: Significant improvements in code structure and functionality

This PR introduces several valuable enhancements to the simulate_snpindel.smk file:

  1. Improved configuration structure for SNP and indel parameters.
  2. Enhanced indel simulation control with new parameters.
  3. More descriptive configuration key names.
  4. Comprehensive workflow summary generation.

These changes align well with the PR objectives of standardizing formatting, simplifying logic, and improving the assembly functionality. The code is now more organized, readable, and maintainable.

Minor suggestions have been made for further improvements:

  • Using more specific variable names for ratio parameters.
  • Ensuring the summary file directory exists before writing.

Overall, these changes represent a significant step forward in the functionality and usability of the harpy tool.

harpy/snakefiles/snp_mpileup.smk (9)

16-19: Improved path handling and config key naming.

Great improvements:

  1. Using os.path.join() for envdir construction enhances cross-platform compatibility.
  2. Changing config["regiontype"] to config["region_type"] aligns with common naming conventions (snake_case).

These changes improve code quality and maintainability.


106-108: Enhanced input handling and output management.

Excellent updates:

  1. Using collect() for bai input is more idiomatic in Snakemake and can improve workflow efficiency.
  2. Marking the output as temp() helps manage disk space by allowing Snakemake to remove this intermediate file after it's no longer needed.

These changes contribute to a more efficient and resource-friendly workflow.


117-118: Improved input specification and consistency.

Good updates:

  1. Adding bam = bamlist as an explicit input enhances clarity and ensures proper dependency tracking.
  2. Using collect() for bai input maintains consistency with the previous rule and leverages Snakemake's efficient input aggregation.

These changes improve the workflow's reliability and readability.


255-258: Improved summary generation logic.

Excellent refactoring:

  1. Using a list for summary instead of a string allows for easier manipulation and formatting of the summary content.
  2. This change effectively resolves the issue mentioned in a previous review comment about the undefined summary_template.

The new approach is more flexible and less error-prone.


260-263: Enhanced summary content and improved readability.

Good improvements:

  1. Additional information in the summary provides more context about the workflow execution.
  2. Using a separate mpileup variable for storing parameters improves code readability and maintainability.

These changes contribute to a more informative and well-structured summary.


264-267: Consistent improvement in summary generation.

Well done:

  1. Introduction of bcfcall variable for bcftools call parameters maintains consistency with the previous improvements.
  2. Clear presentation of bcftools call step information enhances the summary's comprehensiveness.

This change contributes to a more detailed and structured workflow summary.


268-273: Comprehensive workflow documentation.

Excellent additions:

  1. New variables merged and normalize provide detailed information about variant merging and normalization steps.
  2. Consistent structure with previous changes enhances overall readability and maintainability.

These additions significantly improve the comprehensiveness of the workflow summary, making it easier for users to understand the entire process.


274-276: Enhanced reproducibility with workflow invocation details.

Great addition:

  1. The new sm variable captures the Snakemake workflow command, providing crucial information for reproducibility.
  2. Consistent with previous improvements, this addition enhances the overall quality of the summary.

Including the workflow invocation command in the summary is valuable for users who want to reproduce or understand the exact conditions under which the workflow was run.


278-278: Resolved summary writing issue.

Excellent fix:

  1. Using "\n\n".join(summary) efficiently creates the final summary string from the list of summary items.
  2. This change effectively resolves the issue mentioned in a previous review comment about writing an undefined summary_template.

The new approach is more robust and less error-prone, ensuring that the summary is correctly written to the file.

harpy/snakefiles/align_strobealign.smk (7)

17-17: Improved cross-platform compatibility for path handling.

The use of os.path.join() for constructing the envdir path enhances cross-platform compatibility. This is a good practice for handling file paths in Python.


89-89: Improved parameter naming for clarity.

Renaming quality to alignment_quality enhances the clarity of the parameter's purpose. This change is consistent with similar updates in other files, promoting uniformity across the codebase.


247-247: Consistent parameter naming in workflow summary.

The renaming of quality to alignment_quality in the workflow summary is consistent with the earlier change in the align rule. This maintains clarity and consistency throughout the file.


252-254: Improved summary generation and documentation.

The restructuring of the summary generation to use a list improves code readability and maintainability. The addition of new summary lines, including information about the genome file and alignment process, enhances the documentation of the workflow.


256-262: Enhanced alignment command documentation.

The updates to the alignment command construction provide a more detailed and accurate representation of the alignment process in the summary. The addition of conditional logic ensures that the correct commands are documented based on whether automatic or manual read length is used. This improves the overall quality of the workflow documentation.


264-269: Comprehensive documentation of duplicate marking process.

The addition of a detailed breakdown of the duplicate marking process significantly enhances the workflow documentation. This step-by-step explanation improves transparency and reproducibility of the workflow, allowing users to better understand the post-alignment processing steps.


270-274: Improved workflow reproducibility and summary formatting.

The addition of the Snakemake workflow command to the summary enhances reproducibility by documenting how the workflow was called. The use of "\n\n".join(summary) when writing the summary to a file improves the readability of the generated summary by adding appropriate spacing between sections.

harpy/snakefiles/metassembly.smk (6)

132-134: LGTM! Consistent updates to file paths

The updates to the input and output file paths in the bwa_index rule are consistent with the new directory structure introduced earlier. The use of multiext for defining multiple output files with the same prefix is a good practice.


144-148: LGTM! Consistent updates to file paths

The updates to the input and output file paths in the bwa_align rule are consistent with the new directory structure and the changes made in the bwa_index rule. The use of collect for handling multiple input fastq files is a good practice.


183-186: LGTM! Consistent updates to input file paths

The updates to the input file paths in the athena_config rule are consistent with the new directory structure introduced earlier. These changes ensure that the Athena configuration uses the correct files from the updated assembly process.


209-211: LGTM! Consistent updates to input file paths

The updates to the input file paths in the athena rule are consistent with the new directory structure and the changes made in previous rules. These updates ensure that Athena uses the correct input files from the updated assembly process.


Line range hint 1-267: Overall LGTM with suggestions for improvement

The changes to this Snakemake workflow file significantly enhance its flexibility and functionality, particularly in supporting both CloudSpades and MetaSPAdes assembly methods. The addition of new configuration variables, rules, and updates to file paths are well-implemented and consistent throughout the file. The improved workflow summary provides valuable detailed information about the assembly process.

Main points for improvement:

  1. Consider using the containerized variable consistently across all rules that have a container parameter.
  2. Add a comment in the cloudspades_assembly rule explaining the use of original FASTQ files instead of error-corrected ones.
  3. Ensure the new summary file name "metassembly.summary" is used consistently throughout the project.
  4. Consider using f-strings consistently for string formatting in the workflow_summary rule.

These improvements will further enhance the clarity and consistency of the workflow.


266-267: LGTM! Verify consistency of new summary file name

The change of the summary file name from "workflow_summary.txt" to "metassembly.summary" is noted. While this change is appropriate given the context of the workflow, it's important to ensure that this new file name is used consistently throughout the project.

Please run the following script to check for any references to the old file name:

This will help identify any places where the old file name might still be referenced and need updating.

harpy/snakefiles/impute.smk (10)

12-14: LGTM: New wildcard constraints enhance workflow flexibility

The addition of wildcard constraints for paramset and contig improves the robustness of the workflow. These constraints prevent potential issues with invalid characters in wildcards and directory traversal.


16-26: LGTM: Improved variable definitions and file handling

The changes in this section enhance the code quality:

  1. Consistent use of the config dictionary for variable definitions.
  2. Introduction of stitch_params to support the new paramset functionality.
  3. Use of a with statement for reading the biallelic file, which ensures proper resource management.

These modifications improve code readability and resource handling.


54-54: LGTM: Improved dependency management in alignment_list rule

The addition of bailist as an input ensures that BAM index files are created before this rule runs. This change enhances the workflow's dependency management and prevents potential issues with missing index files.


66-73: LGTM: stitch_conversion rule aligned with contig-based workflow

The changes in this rule improve consistency with the new contig-based workflow structure:

  1. Output file naming updated from {part}.stitch to {contig}.stitch.
  2. Shell command now uses {wildcards.contig} for region selection.

These modifications ensure that the rule correctly processes data on a per-contig basis.


80-98: LGTM: Comprehensive updates to impute rule enhance flexibility and dependency management

The changes to the impute rule significantly improve its functionality and integration with the new workflow structure:

  1. Input and output paths now use {contig} and {paramset}, aligning with the new workflow organization.
  2. Addition of bam and bailist inputs improves dependency management.
  3. New parameters referencing stitch_params allow for flexible configuration of STITCH parameters per paramset.

These modifications enhance the rule's flexibility and ensure proper execution within the updated workflow structure.


101-102: LGTM: Flexible conda environment specification

The use of an f-string with envdir for specifying the conda environment improves flexibility and consistency with the envdir variable defined earlier in the file. This change allows for easier management of environment locations across different setups.


Line range hint 108-137: LGTM: Enhanced indexing and reporting with updated index_vcf and new contig_report rules

These changes significantly improve the workflow's functionality:

  1. rule index_vcf now uses {paramset} and {contig}, aligning with the new workflow structure.
  2. The new rule contig_report adds valuable per-contig reporting capabilities.
  3. Parameters in contig_report are consistent with rule impute, ensuring coherent configuration across the workflow.

These modifications enhance the workflow's organization and provide more detailed output for analysis.


145-157: LGTM: concat_list and merge_vcf rules updated for paramset-based workflow

The changes to concat_list and merge_vcf rules successfully integrate them into the paramset-based workflow structure:

  1. Both rules now use {paramset} in their input and output paths, ensuring consistency with the new workflow organization.
  2. The collect statement in concat_list correctly uses double braces {{paramset}} to escape the wildcard in a lambda function.

These modifications ensure proper handling of multiple parameter sets in the workflow.


167-194: LGTM: Consistent updates to post-processing rules for paramset compatibility

The changes to index_merged, general_stats, and compare_stats rules demonstrate consistent integration with the paramset-based workflow:

  1. All rules now use {paramset} in their input and output paths.
  2. The changes are applied consistently across all three rules.

This consistency improves the workflow's maintainability and ensures proper handling of multiple parameter sets throughout the post-processing stages.


205-259: LGTM: Improved reporting and comprehensive workflow summary

The changes to impute_reports and workflow_summary rules significantly enhance the workflow's output and documentation:

  1. impute_reports now uses {paramset} in its paths, aligning with the paramset-based structure.
  2. workflow_summary has been refactored to use a list-based approach, improving readability and maintainability.
  3. The new summary format provides comprehensive information about workflow parameters, preprocessing steps, and execution details.

These improvements enhance the workflow's usability by providing more detailed and structured output, which is valuable for reproducibility and troubleshooting.

harpy/snakefiles/align_ema.smk (5)

27-27: Improved cross-platform compatibility

The use of os.path.join() for constructing the envdir path enhances cross-platform compatibility. This is a good practice that ensures the correct path separator is used regardless of the operating system.


176-176: Consistent parameter naming

The change to use config["alignment_quality"] in the align_bwa rule is consistent with the previous change in the align_ema rule. This maintains consistency across different alignment methods.


338-342: Improved summary generation

The refactoring of the summary generation to use a list of strings and provide more detailed descriptions for each step is a significant improvement. This change enhances readability and maintainability of the code. The use of f-strings for command construction is also a good practice that improves code clarity.


367-369: Useful addition of workflow call information

The inclusion of the Snakemake workflow call command in the summary is a valuable addition. It provides important context on how the workflow was executed, which can be helpful for reproducibility and debugging.


150-150: Standardized parameter naming

The change from config["quality"] to config["alignment_quality"] appears to be part of a broader effort to standardize parameter naming. This is a good practice for maintaining consistency across the project.

To ensure this change is consistent across the codebase, please run the following script:

✅ Verification successful

Parameter naming standardized successfully

All instances of config["quality"] have been replaced with config["alignment_quality"] across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining instances of config["quality"] and config["alignment_quality"]
echo "Searching for config[\"quality\"]:"
rg 'config\["quality"\]'
echo "Searching for config[\"alignment_quality\"]:"
rg 'config\["alignment_quality"\]'

Length of output: 973

Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
Repository owner deleted a comment from coderabbitai Bot Oct 22, 2024
@pdimens pdimens merged commit f14997e into main Oct 22, 2024
@pdimens pdimens deleted the cloudspades branch October 22, 2024 13:10
This was referenced Feb 20, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Oct 8, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add assembler via cloudspades

1 participant