-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic Memory service start [extraction] #1965
Conversation
6e1c134
to
30ccac8
Compare
30ccac8
to
35a2f60
Compare
032797c
to
022216a
Compare
and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.
022216a
to
ee1104e
Compare
break; | ||
|
||
case MimeTypes.MsWord: | ||
text = new MsWordDecoder().DocToText(fileContent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice if the decoders were injectable implementations of an interface that could be mapped by MimeType. I'm thinking about cases when different chunking strategies are needed for a certain use case.
Or maybe this is a more basic precursor to chunking that won't have many options on how it's implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, this is an area where there will be work to do once the basic end to end flow is ready.
some notes:
- I introduced a mime detection interface, so it should be already possible to swap it out with a better one
- I'm thinking about allowing clients to pass a "type hint" so the code doesn't fail when the file extension is missing (this is frequent on Linux/MacOS)
- I've been planning to create a "type detection" handler that relies on the
file
command, to make this part more robust (see https://en.wikipedia.org/wiki/File_(command)). One option here would be using a docker image - I'm considering docker also for other scenarios, e.g. exporting text from some specific file types. Something for later, maybe in 2-3 sprints from now
BlobUploadOptions options = new(); | ||
BlobLeaseClient? blobLeaseClient = null; | ||
BlobLease? lease = null; | ||
if (await blobClient.ExistsAsync(cancellationToken).ConfigureAwait(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be faster to skip the call to ExistsAsync and instead just try to get the leaseClient and lease. If it doesn't exist it will throw and the leaseClient/lease will remain null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting idea, that should reduce the number of requests, thanks. LeaseBlobAsync() throws exceptions also when the file is already leased, so the code will also have to check the status code (404 vs 409 probably). Keeping this open while I do some tests
@@ -0,0 +1,77 @@ | |||
# Semantic Memory Service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for overall PR: Where are all the tests?
First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.
First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.
First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.
First part of the encoding pipeline, with support for synchronous and asynchronous processing. Pipelines can run locally, storing data on disk and in the cloud using Azure blobs. Asynchronous processing supports RabbitMQ and Azure Queues.