Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Memory service start [extraction] #1965

Merged
merged 1 commit into from
Jul 12, 2023

Conversation

dluc
Copy link
Collaborator

@dluc dluc commented Jul 11, 2023

First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.

@shawncal shawncal added the docs and tests Improvements or additions to documentation label Jul 11, 2023
@dluc dluc requested a review from alliscode July 11, 2023 23:56
@dluc dluc force-pushed the dluc165memoryservice branch 4 times, most recently from 6e1c134 to 30ccac8 Compare July 12, 2023 00:18
@dluc dluc added PR: ready for review All feedback addressed, ready for reviews memory connector labels Jul 12, 2023
@dluc dluc changed the title Semantic Memory start [extraction] Semantic Memory service start [extraction] Jul 12, 2023
@dluc dluc force-pushed the dluc165memoryservice branch 2 times, most recently from 032797c to 022216a Compare July 12, 2023 17:16
and asynchronous processing. Pipelines can run locally, storing data on
disk and in the cloud using Azure blobs. Asynchronous processing
supports RabbitMQ and Azure Queues.
services/semantic-memory/README.md Show resolved Hide resolved
break;

case MimeTypes.MsWord:
text = new MsWordDecoder().DocToText(fileContent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice if the decoders were injectable implementations of an interface that could be mapped by MimeType. I'm thinking about cases when different chunking strategies are needed for a certain use case.

Or maybe this is a more basic precursor to chunking that won't have many options on how it's implemented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this is an area where there will be work to do once the basic end to end flow is ready.
some notes:

  • I introduced a mime detection interface, so it should be already possible to swap it out with a better one
  • I'm thinking about allowing clients to pass a "type hint" so the code doesn't fail when the file extension is missing (this is frequent on Linux/MacOS)
  • I've been planning to create a "type detection" handler that relies on the file command, to make this part more robust (see https://en.wikipedia.org/wiki/File_(command)). One option here would be using a docker image - I'm considering docker also for other scenarios, e.g. exporting text from some specific file types. Something for later, maybe in 2-3 sprints from now

BlobUploadOptions options = new();
BlobLeaseClient? blobLeaseClient = null;
BlobLease? lease = null;
if (await blobClient.ExistsAsync(cancellationToken).ConfigureAwait(false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be faster to skip the call to ExistsAsync and instead just try to get the leaseClient and lease. If it doesn't exist it will throw and the leaseClient/lease will remain null.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting idea, that should reduce the number of requests, thanks. LeaseBlobAsync() throws exceptions also when the file is already leased, so the code will also have to check the status code (404 vs 409 probably). Keeping this open while I do some tests

@@ -0,0 +1,77 @@
# Semantic Memory Service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for overall PR: Where are all the tests?

@dluc dluc merged commit 4d4c415 into microsoft:main Jul 12, 2023
8 checks passed
@dluc dluc deleted the dluc165memoryservice branch July 12, 2023 18:03
piotrek-appstream pushed a commit to Appstream-Studio/semantic-kernel that referenced this pull request Jul 19, 2023
First part of the encoding pipeline, with support for synchronous and
asynchronous processing. Pipelines can run locally, storing data on disk
and in the cloud using Azure blobs. Asynchronous processing supports
RabbitMQ and Azure Queues.
johnoliver pushed a commit to johnoliver/semantic-kernel that referenced this pull request Jun 5, 2024
First part of the encoding pipeline, with support for synchronous and
asynchronous processing. Pipelines can run locally, storing data on disk
and in the cloud using Azure blobs. Asynchronous processing supports
RabbitMQ and Azure Queues.
johnoliver pushed a commit to johnoliver/semantic-kernel that referenced this pull request Jun 5, 2024
First part of the encoding pipeline, with support for synchronous and
asynchronous processing. Pipelines can run locally, storing data on disk
and in the cloud using Azure blobs. Asynchronous processing supports
RabbitMQ and Azure Queues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs and tests Improvements or additions to documentation memory connector PR: ready for review All feedback addressed, ready for reviews
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants