Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage-MongoDB 2 steps loading #288

Closed
17 tasks done
j-coll opened this issue Apr 11, 2016 · 1 comment
Closed
17 tasks done

Storage-MongoDB 2 steps loading #288

j-coll opened this issue Apr 11, 2016 · 1 comment
Assignees
Milestone

Comments

@j-coll
Copy link
Member

j-coll commented Apr 11, 2016

  • Define the stage collection schema
    • Document the stage collection schema in the wiki
  • Implement loaders and readers from the stage collection
    • Avoid false DupKey exception with concurrent upserts. see MongoDB [SERVER-14322]
    • Read sorted. Split not overlapping batches.
  • Implement new Variant loader reading from the stage collection
    • Detect overlapping variants
    • Fetch loaded variants for overlapping variants
    • Handle horizontally splitted files (e.g. divided per chromosome)
    • Add specific tests scanning MongoDB to check the documents.
    • Update VariantMongoDBWriterTest
  • Add direct loader for the first file in a chromosome to load directly to the variants collection . see Add direct loader for the first file in a chromosome to load directly to the variants collection #354
  • Implement batch load method using the new functionality : VariantStorageManager#index(List<URI>)
  • Check resume mode
    • This will require to record the ongoing operations for staging and merging. Use BatchFileOperation
    • Resume mode for Stage loading step
    • Resume mode for Merge step
  • Add batch load tests
  • Concurrent merge from different studies. See "concurrentMerge"
@j-coll j-coll self-assigned this Apr 11, 2016
@j-coll j-coll added this to the v0.8.0 milestone Apr 11, 2016
j-coll added a commit to j-coll/opencga that referenced this issue Apr 15, 2016
j-coll added a commit to j-coll/opencga that referenced this issue Apr 15, 2016
Duplicated variants in Stage are not inserted in variants, but then set
as not new variants for the study.
j-coll added a commit that referenced this issue Apr 18, 2016
j-coll added a commit that referenced this issue Apr 19, 2016
j-coll added a commit that referenced this issue Apr 28, 2016
j-coll added a commit that referenced this issue Apr 29, 2016
j-coll added a commit that referenced this issue Apr 29, 2016
j-coll added a commit that referenced this issue Jun 28, 2016
Use query ID = chr:pos:ref:alt instead of REGION, REFERENCE, ALTERNATE

The previous version can return wrong results if in that region there is
a variant with the same REF and ALT.
j-coll added a commit that referenced this issue Jun 29, 2016
j-coll added a commit that referenced this issue Jun 30, 2016
Avoid leave documents in stage collection

setStatus RUNNING
merge
setStatus DONE
clean stage
update indexed files + setStatus READY
@j-coll
Copy link
Member Author

j-coll commented Jul 4, 2016

When updating the variants collection, there are tree possible scenarios:

  • The document does not exists (newVariant)
  • The document exists but the study doesn't (newStudy)
  • The document and the study exists

The first approach will consist in making a mongodb.insert for each newVariant, and updates for the other operations.

Allowing concurrent merges from different studies, it may happen that a document, supposedly non existing in the collection (newVariant), to be inserted from several studies at the same time, causing a DupKey exception.

To sort this out, the easiest solution is to perform a mongodb.update with the param $setOnInsert.

This second approach will use only update operations, solving the first and the second scenarios at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant