Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export/Download document support #415

Merged
merged 5 commits into from
May 1, 2024
Merged

Conversation

coryisakson
Copy link
Contributor

@coryisakson coryisakson commented Apr 18, 2024

Motivation and Context (Why the change? What's the scenario?)

Validating AI answers requires access to the source grounding documents and data. The KM solution enables easy ingestion of grounding documents as well as the ability to remove documents. A file download feature allows consumers access to the grounding source materials and allow them to verify the answers presented by the ASK endpoint.

High level description (Approach, Design)

  • New methods to memory and storage interface, to allow access to individual files
  • New web service endpoint to download files, passing index name, document ID and file name
  • Include download link in RAG citations and search results:
image * Stream file download and support Range fetch

Since KM is a backend service not meant for multi-user direct access (ie KM security model is based on a single key, like a SQL server or any DB), the endpoint provides direct access to all files, similarly to search allows access to all memory records. For public deployments, a middleware webservice should take care of securing links, e.g. adding and validating signatures and user tokens.

@coryisakson coryisakson requested a review from dluc as a code owner April 18, 2024 23:11
@coryisakson
Copy link
Contributor Author

@microsoft-github-policy-service agree [company="Microsoft"]

@coryisakson
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@dluc
Copy link
Collaborator

dluc commented Apr 22, 2024

@coryisakson could you merge the latest changes from main and check che build? I can't build the code, with a bunch or warnings and errors. Thanks

@dluc dluc added the waiting for author Waiting for author to reply or address comments label Apr 22, 2024
@coryisakson
Copy link
Contributor Author

@dluc the branch is updated and unit tests are passing.

@dluc
Copy link
Collaborator

dluc commented Apr 24, 2024

I tried fixing the conflicts and reviewing but the PR it too big. I think it would really help excluding changes that are unrelated to the new feature, e.g. spacing, string changes (like the mime type in qdrant). E.g. rather than 52 files maybe bring the PR down to 30 files or so. Given the big number of changes to interfaces and new classes it's going to take some time.
By the way I tried to rebuild the branch so that it can be more easily rebased when main changes, but even that was taking too long. If you have a chance to squash all the changes into a single commit, that would help managing the PR.

See also other comment, I would split the changes to IContentStorage out, so we can review the "write" changes first and make this PR easier to manage

.vscode/launch.json Outdated Show resolved Hide resolved
.vscode/settings.json Outdated Show resolved Hide resolved
extensions/Qdrant/Qdrant/Client/Http/HttpRequest.cs Outdated Show resolved Hide resolved
nuget.config Outdated Show resolved Hide resolved
service/tests/TestHelpers/TestHelpers.csproj Outdated Show resolved Hide resolved
service/Abstractions/ContentStorage/IContentStorage.cs Outdated Show resolved Hide resolved
service/Abstractions/ContentStorage/IContentStorage.cs Outdated Show resolved Hide resolved
service/Abstractions/ContentStorage/IContentFile.cs Outdated Show resolved Hide resolved
service/Abstractions/ContentStorage/IContentFile.cs Outdated Show resolved Hide resolved
@dluc
Copy link
Collaborator

dluc commented Apr 24, 2024

A quick thought about the changes to IContentStorage: we can reuse the current mime detection to know the mime type, without the need of storing it. I understand that storing it would be ideal, but we can do that separately and later. This should allow to make the PR much smaller. Thoughts?

nuget.config Outdated Show resolved Hide resolved
@dluc
Copy link
Collaborator

dluc commented Apr 26, 2024

I rebuilt the branch fixing some merge gone wrong, and making a few minor changes to namespaces and names. I see the approach taken introduces a new dependency with the responsibility of checking access and downloading files, which I'm not sure about, in terms of design. I'll try playing with some changes, reorganizing these responsibilities and how memory/storage/orchestrator work together to provide the same functionality. My preference would be about leveraging the orchestrator, not nesting content access into the validation service (which should just validate). I haven't looked at the download part yet, e.g IContent interface, which might actually be more important given it affects the primary API.


namespace Microsoft.KernelMemory;

public sealed class StreamableFileContent : IDisposable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we can reuse .NET FileInfo class and delete this one

dluc added a commit that referenced this pull request May 1, 2024
## Motivation and Context (Why the change? What's the scenario?)

Supporting abstractions for new File Download feature. See also PR #415

## High level description (Approach, Design)

* New version 0.40
* Breaking changes on storage interface
* New methods on orchestration interface
* New methods on memory interface
@dluc dluc merged commit 48a33d4 into microsoft:main May 1, 2024
2 checks passed
@dluc dluc removed the waiting for author Waiting for author to reply or address comments label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants