Export/Download document support #415

coryisakson · 2024-04-18T23:11:22Z

Motivation and Context (Why the change? What's the scenario?)

Validating AI answers requires access to the source grounding documents and data. The KM solution enables easy ingestion of grounding documents as well as the ability to remove documents. A file download feature allows consumers access to the grounding source materials and allow them to verify the answers presented by the ASK endpoint.

High level description (Approach, Design)

New methods to memory and storage interface, to allow access to individual files
New web service endpoint to download files, passing index name, document ID and file name
Include download link in RAG citations and search results:

* Stream file download and support Range fetch

Since KM is a backend service not meant for multi-user direct access (ie KM security model is based on a single key, like a SQL server or any DB), the endpoint provides direct access to all files, similarly to search allows access to all memory records. For public deployments, a middleware webservice should take care of securing links, e.g. adding and validating signatures and user tokens.

coryisakson · 2024-04-18T23:15:13Z

@microsoft-github-policy-service agree [company="Microsoft"]

coryisakson · 2024-04-18T23:16:55Z

@microsoft-github-policy-service agree company="Microsoft"

dluc · 2024-04-22T18:16:41Z

@coryisakson could you merge the latest changes from main and check che build? I can't build the code, with a bunch or warnings and errors. Thanks

coryisakson · 2024-04-23T16:45:50Z

@dluc the branch is updated and unit tests are passing.

dluc · 2024-04-24T01:40:54Z

I tried fixing the conflicts and reviewing but the PR it too big. I think it would really help excluding changes that are unrelated to the new feature, e.g. spacing, string changes (like the mime type in qdrant). E.g. rather than 52 files maybe bring the PR down to 30 files or so. Given the big number of changes to interfaces and new classes it's going to take some time.
By the way I tried to rebuild the branch so that it can be more easily rebased when main changes, but even that was taking too long. If you have a chance to squash all the changes into a single commit, that would help managing the PR.

See also other comment, I would split the changes to IContentStorage out, so we can review the "write" changes first and make this PR easier to manage

.vscode/launch.json

.vscode/settings.json

extensions/Qdrant/Qdrant/Client/Http/HttpRequest.cs

nuget.config

service/Abstractions/Compositions/ExportValidationService.cs

service/tests/TestHelpers/TestHelpers.csproj

service/Abstractions/ContentStorage/IContentStorage.cs

service/Abstractions/ContentStorage/IContentFile.cs

dluc · 2024-04-24T18:20:12Z

A quick thought about the changes to IContentStorage: we can reuse the current mime detection to know the mime type, without the need of storing it. I understand that storing it would be ideal, but we can do that separately and later. This should allow to make the PR much smaller. Thoughts?

nuget.config

clients/dotnet/SemanticKernelPlugin/MemoryPlugin.cs

examples/001-dotnet-WebClient/Program.cs

extensions/AzureAISearch/AzureAISearch/AzureAISearchFiltering.cs

dluc · 2024-04-26T02:20:33Z

I rebuilt the branch fixing some merge gone wrong, and making a few minor changes to namespaces and names. I see the approach taken introduces a new dependency with the responsibility of checking access and downloading files, which I'm not sure about, in terms of design. I'll try playing with some changes, reorganizing these responsibilities and how memory/storage/orchestrator work together to provide the same functionality. My preference would be about leveraging the orchestrator, not nesting content access into the validation service (which should just validate). I haven't looked at the download part yet, e.g IContent interface, which might actually be more important given it affects the primary API.

service/Service.AspNetCore/WebAPIEndpoints.cs

extensions/MongoDbAtlas/MongoDbAtlas/MongoDbAtlasStorage.cs

service/Abstractions/ContentStorage/IContentStorage.cs

clients/dotnet/WebClient/MemoryWebClient.cs

clients/dotnet/WebClient/DocumentQuery.cs

dluc · 2024-04-30T19:29:13Z

service/Abstractions/Models/StreamableFileContent.cs

+
+namespace Microsoft.KernelMemory;
+
+public sealed class StreamableFileContent : IDisposable


wondering if we can reuse .NET FileInfo class and delete this one

## Motivation and Context (Why the change? What's the scenario?) Supporting abstractions for new File Download feature. See also PR #415 ## High level description (Approach, Design) * New version 0.40 * Breaking changes on storage interface * New methods on orchestration interface * New methods on memory interface

service/tests/Core.FunctionalTests/ServerLess/SubDirFilesAndStreamsTest.cs

…reamsTest.cs

coryisakson requested a review from dluc as a code owner April 18, 2024 23:11

dluc added the waiting for author Waiting for author to reply or address comments label Apr 22, 2024

dluc requested changes Apr 24, 2024

View reviewed changes