Semantic Memory service start [extraction] #1965

dluc · 2023-07-11T23:50:06Z

First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.

and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.

services/semantic-memory/README.md

services/semantic-memory/handlers-dotnet/Program.cs

alliscode · 2023-07-12T16:38:47Z

services/semantic-memory/handlers-dotnet/TextExtractionHandler.cs

+                    break;
+
+                case MimeTypes.MsWord:
+                    text = new MsWordDecoder().DocToText(fileContent);


It might be nice if the decoders were injectable implementations of an interface that could be mapped by MimeType. I'm thinking about cases when different chunking strategies are needed for a certain use case.

Or maybe this is a more basic precursor to chunking that won't have many options on how it's implemented?

agreed, this is an area where there will be work to do once the basic end to end flow is ready.
some notes:

I introduced a mime detection interface, so it should be already possible to swap it out with a better one

I'm thinking about allowing clients to pass a "type hint" so the code doesn't fail when the file extension is missing (this is frequent on Linux/MacOS)

I've been planning to create a "type detection" handler that relies on the file command, to make this part more robust (see https://en.wikipedia.org/wiki/File_(command)). One option here would be using a docker image - I'm considering docker also for other scenarios, e.g. exporting text from some specific file types. Something for later, maybe in 2-3 sprints from now

alliscode · 2023-07-12T17:05:52Z

services/semantic-memory/lib-dotnet/Storage/ContentStorage/AzureBlob.cs

+        BlobUploadOptions options = new();
+        BlobLeaseClient? blobLeaseClient = null;
+        BlobLease? lease = null;
+        if (await blobClient.ExistsAsync(cancellationToken).ConfigureAwait(false))


nit: It might be faster to skip the call to ExistsAsync and instead just try to get the leaseClient and lease. If it doesn't exist it will throw and the leaseClient/lease will remain null.

interesting idea, that should reduce the number of requests, thanks. LeaseBlobAsync() throws exceptions also when the file is already leased, so the code will also have to check the status code (404 vs 409 probably). Keeping this open while I do some tests

alliscode · 2023-07-12T17:58:28Z

services/semantic-memory/README.md

@@ -0,0 +1,77 @@
+# Semantic Memory Service


Comment for overall PR: Where are all the tests?

First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.

shawncal added the docs and tests Improvements or additions to documentation label Jul 11, 2023

dluc requested a review from alliscode July 11, 2023 23:56

dluc force-pushed the dluc165memoryservice branch 4 times, most recently from 6e1c134 to 30ccac8 Compare July 12, 2023 00:18

dluc added PR: ready for review All feedback addressed, ready for reviews memory connector labels Jul 12, 2023

dluc changed the title ~~Semantic Memory start [extraction]~~ Semantic Memory service start [extraction] Jul 12, 2023

dluc force-pushed the dluc165memoryservice branch from 30ccac8 to 35a2f60 Compare July 12, 2023 07:18

shawncal removed the memory connector label Jul 12, 2023

dluc force-pushed the dluc165memoryservice branch 2 times, most recently from 032797c to 022216a Compare July 12, 2023 17:16

dluc added the memory connector label Jul 12, 2023

shawncal removed the memory connector label Jul 12, 2023

First part of the encoding pipeline, with support for synchronous

ee1104e

and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.

dluc force-pushed the dluc165memoryservice branch from 022216a to ee1104e Compare July 12, 2023 17:22

dluc added the memory connector label Jul 12, 2023

alliscode approved these changes Jul 12, 2023

View reviewed changes

dluc merged commit 4d4c415 into microsoft:main Jul 12, 2023
8 checks passed

dluc deleted the dluc165memoryservice branch July 12, 2023 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Memory service start [extraction] #1965

Semantic Memory service start [extraction] #1965

dluc commented Jul 11, 2023

alliscode Jul 12, 2023

dluc Jul 12, 2023

alliscode Jul 12, 2023

dluc Jul 12, 2023

alliscode Jul 12, 2023

Semantic Memory service start [extraction] #1965

Semantic Memory service start [extraction] #1965

Conversation

dluc commented Jul 11, 2023

alliscode Jul 12, 2023

Choose a reason for hiding this comment

dluc Jul 12, 2023

Choose a reason for hiding this comment

alliscode Jul 12, 2023

Choose a reason for hiding this comment

dluc Jul 12, 2023

Choose a reason for hiding this comment

alliscode Jul 12, 2023

Choose a reason for hiding this comment