Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace stampy with minimap2 #27

Merged
merged 4 commits into from
Feb 10, 2021
Merged

Conversation

bricoletc
Copy link
Collaborator

@iqbal-lab could you give me access to the cortex repo? so I could in future push code to a minimap2 branch for eg.
In the meantime this is for reviewing my changes.

Code

  • Remove stampy command line args in run_calls.pl and
    process_calls.pl, replacing with a single minimap2_bin argument
  • instead of stampy index, use ref fasta (passed as --ref_fasta)
    to place variants in process_calls.pl (minimap2 can index ref on the
    fly)
    Previously, --ref_fasta was used to check for misplaced variants. run_calls.pl always uses the option. Now, this always occurs, as --ref_fasta is used to place variants.

Checks

I have validated that I get the same VCF when running py_cortex_api (it wraps cortex's independent workflow) with stampy and with minimap2, on cortex's demo files example1

TODOs

  • A systematic check on several datasets- notably, I observed minimap2 MAPQ seems lower than stampy for 50bp reads
  • Update cortex manual
  • Maybe this code can be removed, as minimap2 could probably map >1000bp sequence even in short-read preset? :
    ## if the 5p flank is >1000bp, then stampy fails, so we cut the 5p flank and only take the last 1000bp.
    if ( exists $href_var_name_to_cut_flank->{$name} )
    {
    ##the flank5p passedin is directly taken from the callfile
    $flank5p = substr( $flank5p, -1000 );
    if ( length($flank5p) != 1000 )
    {
    die("perl issue with $flank5p");
    }
    }

* Remove stampy command line args in run_calls.pl and
process_calls.pl, replacing with a single minimap2_bin argument
* instead of stampy index, use ref fasta (passed as --ref_fasta)
  to place variants in process_calls.pl (minimap2 can index ref on the
fly)
  And check for misplaced variants using fasta ref by default
* Replace stampy mentions in manual and cheatsheets
* Replace stampy in other scripts/calling
* Remove backup data files
* Use 'thinner' minimisers: smaller k and w, to
  accommodate small flanks being mapped
* Use lower mapq threshold, as shorter flanks get lower
  mapq in minimap2
* Allow for secondary/supplementary mappings in sam
Remove mentions of mapq in `process_calls.pl` command-line help;
as i)mapq threshold is given in output VCF and ii)mapq threshold is now
1, so it is inappropriate to say mapping is confident.
@bricoletc
Copy link
Collaborator Author

Logging here a validation evaluation of using minimap2. Evaluation uses varifier on one staph dataset from martin (with truth genome assembled using pilon), :

I found that the 'good' minimap2 settings (-k9 -w4) give following recall/precision:
  * require mapq > 40: "Precision_edit_dist": 0.99996956, "Recall_edit_dist": 0.50726812
  * require mapq > 0:  "Precision_edit_dist": 0.99998191, "Recall_edit_dist": 0.83881474,
  * previous cortex, using stampy and mapq > 40, gets "Precision_edit_dist": 0.98863584, "Recall_edit_dist": 0.83029847,
on this dataset, minimap2 with mapq>0 does a little better in both metrics.

@bricoletc bricoletc merged commit f6ff403 into iqbal-lab:master Feb 10, 2021
@bricoletc
Copy link
Collaborator Author

Logging some more validation work using 14 ilmn Plasmodium falciparum samples with matches pacb assemblies (in previous comment, ran on a Staphylococcus aureus dataset):

stampy: recall: 0.4490 precision: 0.9118
minimap2: recall: 0.4443 precision: 0.9258

These are the average of metrics computed using varifier. THis confirms the change works fine.

@iqbal-lab
Copy link
Owner

hurrah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants