Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proto!): make document a recursive structure #652

Merged
merged 53 commits into from Jul 22, 2020
Merged

Conversation

hanxiao
Copy link
Member

@hanxiao hanxiao commented Jul 13, 2020

Estimated Done 25th July
This will be part of v0.5.0

Current test progress

  • hello-world can be run without any problem
  • unit test is missing for all new QL drivers, will do in another PR

Intro

This will be a super big PR with breaking changes. Simply put, it removes the protobuf spec of "Chunk". Now, a "Chunk" is defined as a "Document" that belongs to another "Document". Hence, "Document" is now a recursive structure. One can visit chunks via doc.chunks and get children List[Document]

This is an upgrade over the previous "Document/Chunk" concept. It makes the whole thing more general, enables advance segmenting and matching (#648)

Affects

  • (Thanks to Jina's design btw) this should mostly affect drivers, not executors.
  • drivers that previously work with doc/chunk will be refactored by recursively working on the doc structure.
  • all previous chunk-related drivers/executors are removed because the interface is now unified.
  • Executors do not need to know protobuf exists. They are "document/chunk-agnostic". All structure abstraction is taken care by Drivers.

Breakdown

  • Change protobuf, delete the definition of Chunk.
  • Add a BaseRecursiveDriver that works on single doc recursively in preorder and postoder.
  • Refactoring drivers using the newly added BaseRecursiveDriver
    • craft.py
    • convert.py
    • prune.py
    • control.py: no need to modify
    • encode.py
    • index.py
    • score.py
    • search.py
    • reduce.py
  • all YAML resources
  • all unit tests (will cover in other PRs)
  • all examples, freeze the jina version this time!

Migration guide (draft)

  • level is not required in all drivers anymore.

  • PruneDriver is now ExcludeQL, no level is required, excludes -> fields, e.g.

    - !ExcludeQL
      with:
        fields:
          - embedding
          - buffer
          - blob
          - text
  • DocPruneDriver is removed, one can now use
    - !ExcludeQL
      with:
        fields:
          - buffer
          - chunks
  • ChunkPruneDriver is removed, one can now use the following, as one can see it is basically the same as DocPruneDriver , just with different fields.
    - !ExcludeQL
      with:
        fields:
          - embedding
          - buffer
          - blob
          - text
  • BaseCraftDriver is now CraftDriver. DocCraftDriver and ChunkCraftDriver are removed
  • doc_id and chunk_id are unified as id now
  • UnarySegmentDriver is removed. Now EncodeDriver works directly on doc-level embedding, extract_chunks is now refactored as extract_docs works on root-level extraction (previously it works on doc.chunks level extraction)
  • BaseChunkCrafter and BaseDocCrafter are removed.
  • TopkFilterDriver is removed and now is implemented with Query Language:
!SliceQL
with:
   start: 0
   end: k
  • ChunkPbIndex and DocPbIndexer are now removed, please use BinaryPbIndexer
  • ReqPruneDriver is removed now use
    - !ExcludeReqQL
      with:
        fields:
          - request
  • _merge_topk_docs -> _merge_all

@hanxiao hanxiao marked this pull request as draft July 13, 2020 15:08
@github-actions
Copy link

github-actions bot commented Jul 13, 2020

Jina CLA check ✅ All Contributors have signed the CLA.

@github-actions
Copy link

This PR closes: #648

@jina-bot jina-bot added size/L area/core This issue/PR affects the core codebase area/helper This issue/PR affects the helper functionality area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/client component/driver component/executor component/proto executor/indexer labels Jul 13, 2020
jina/drivers/filter.py Outdated Show resolved Hide resolved
tests/test_quant.py Outdated Show resolved Hide resolved
@jina-bot jina-bot added size/XL and removed size/L labels Jul 14, 2020
@JoanFM
Copy link
Member

JoanFM commented Jul 14, 2020

@hanxiao better do not activate integration-test as required until the PR #651 is merged, and integration test become a reality

@hanxiao hanxiao linked an issue Jul 19, 2020 that may be closed by this pull request
@JoanFM
Copy link
Member

JoanFM commented Jul 22, 2020

Will solve #669

@hanxiao hanxiao marked this pull request as ready for review July 22, 2020 09:02
@hanxiao hanxiao linked an issue Jul 22, 2020 that may be closed by this pull request
@hanxiao hanxiao merged commit bbb57df into master Jul 22, 2020
@hanxiao hanxiao deleted the feat-proto-648 branch July 22, 2020 09:12
@github-actions github-actions bot locked and limited conversation to collaborators Jul 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/core This issue/PR affects the core codebase area/docs This issue/PR affects the docs area/helloworld This issue/PR affects the helloworld area/helper This issue/PR affects the helper functionality area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/client component/driver component/executor component/proto component/resource executor/crafter executor/indexer size/XL
Projects
None yet
3 participants