Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create worker code from module instead of from string #563

Closed
jeshua-clipchamp opened this issue Aug 31, 2022 · 16 comments
Closed

Create worker code from module instead of from string #563

jeshua-clipchamp opened this issue Aug 31, 2022 · 16 comments
Assignees

Comments

@jeshua-clipchamp
Copy link

What's the problem?

In Timeout.ts a worker is loaded from a hard-coded string within the file. This doesn't play nicely with the Content-Security-Policy, where if you allow creating workers from strings you have to let anyone create a worker from a string (as far as I could tell there is no way to limit it to specific libraries, worker code with specific hashes, from files with specific hashes etc.). This is a pretty big security hole for something which is relatively unimportant (my understanding of this worker is to provide accurate timing if the user changes tabs, is that correct?).

We are currently working around this by patching the source code on checkout to just use the standard setTimeout:

diff --git a/distrib/es2015/src/common.speech/ServiceRecognizerBase.js b/distrib/es2015/src/common.speech/ServiceRecognizerBase.js
index 2fba3f3f26e134e720c78e9a692ff84c4e0d3898..1865fcc0aaf20768d85877b3dae50901b3a394ec 100644
--- a/distrib/es2015/src/common.speech/ServiceRecognizerBase.js
+++ b/distrib/es2015/src/common.speech/ServiceRecognizerBase.js
@@ -557,7 +557,7 @@ export class ServiceRecognizerBase {
     }
     delay(delayMs) {
         return new Promise((resolve, reject) => {
-            this.privSetTimeout(resolve, delayMs);
+            setTimeout(resolve, delayMs);
         });
     }
     writeBufferToConsole(buffer) {

Possible solutions

  • Move the worker code into a file which can be compiled/bundled along with everything else (then worker-src 'self' should work)
  • Add an option to not use the worker (so we don't have to patch things)
@glharper
Copy link
Member

glharper commented Sep 5, 2022

@jeshua-clipchamp, Thank you for using Speech SDK, and writing up this issue. This admittedly obtuse method of setting timeouts is a workaround for allowing continued recognition when the browser is minimized. (See #74 for more information.)

Regressing this behavior is not tenable, obviously, and I can't make any promises yet, but adding an option not to use the worker seems doable, as long as the browser-minimized regression is not something you care about when that option is set. I'll update this issue when I get a chance to investigate further.

@glharper glharper self-assigned this Sep 5, 2022
@glharper
Copy link
Member

glharper commented Sep 8, 2022

@jeshua-clipchamp Having investigated this issue, I'm curious why the worker in PCMRecorder (created from a string) is acceptable, but the timeout worker is not.

I'm also curious about the issue you're describing with Content-Security-Policy, as you can specify the allowed URL of any workers using the worker-src tag, described here.

Would you mind elaborating a bit on why worker-src doesn't work for your use case?

@glharper glharper added question Further information is requested pending close Ready for closure pending follow-up or prolonged inactivity labels Sep 8, 2022
@glharper
Copy link
Member

Closing this on lack of response to follow-up questions, may re-open on response.

@jeshua-clipchamp
Copy link
Author

Whoops, sorry about the lack of response Github decided to not notify me about your replies 😅

For your first question, I assume that PCMRecorder is OK because we don't use the record functionality (we are pulling out audio from an existing video file and sending that directly rather than captioning as it is spoken). As long as it never gets executed, I guess CSP is totally find 😄

For your second point, worker-src lets you specify a domain where worker code can be loaded from, but because these files are loaded from a string it can't figure out that the code for the string came from the same source file (I guess they account for the case where you manually fetch the code and then create a worker via string rather than creating it from a URL). From my testing the only way to get it loading is to allow any blob through (by specifying worker-src blob:), but there is no way to specify exactly where the blob is allowed to come from.

Having the worker code as a separate file and loading it from that should be enough to fix this (because then we could do worker-src self and it should work), not sure how to do that with your package/deployment though 🙂

@glharper
Copy link
Member

@jeshua-clipchamp When looking at this before, using a data URL seemed to be a way forward that works with our current build processes. See this branch's Timeout.ts for an example of loading a base64 string as a worker. Let me know your thoughts.

@glharper glharper removed the pending close Ready for closure pending follow-up or prolonged inactivity label Sep 29, 2022
@jeshua-clipchamp
Copy link
Author

I suspect that would have the same kind of problem (it's still loading a worker from a string after all). Another option might be to allow users to supply the worker themselves, then we could do some shenanigans to build the worker from a file or something?

@glharper
Copy link
Member

glharper commented Oct 4, 2022

I suspect that would have the same kind of problem (it's still loading a worker from a string after all).

@jeshua-clipchamp From Using Web Workers:
"To specify a content security policy for the worker, set a Content-Security-Policy response header for the request which delivered the worker script itself.

The exception to this is if the worker script's origin is a globally unique identifier (for example, if its URL has a scheme of data or blob). In this case, the worker does inherit the CSP of the document or worker that created it."

@glharper glharper removed the question Further information is requested label Oct 5, 2022
@jeshua-clipchamp
Copy link
Author

Right, but this means you have to include blob: as a scheme for loading workers but there is no way to restrict where this blob comes from (or what the sha of its contents is), meaning that if we allow workers to be made from blobs everything would be able to make workers from blobs (rather than what we have now which is workers can only be created from self).

@glharper
Copy link
Member

glharper commented Oct 6, 2022

@jeshua-clipchamp In the upcoming release, the code will be included as a data url, as shown in the merged PR above.

@jeshua-clipchamp
Copy link
Author

Hmmm I'll give that a go when it comes out, thanks for the fix :) I'm interested as to whether you can only allow data URLs from specific domains with CSP but I guess we'll see :D

@glharper
Copy link
Member

@jeshua-clipchamp JS Speech SDK v1.24 has been released, with this fix included. Thanks again for submitting this issue!

@ishowta
Copy link

ishowta commented Feb 13, 2023

I'm not sure about security, but isn't there any difference between blobs and data other than the encoding method?

@me4502
Copy link
Member

me4502 commented Jun 14, 2023

From what I can tell the provided fix might not resolve the original concern, as the data: scheme in CSP does not allow any further specificity (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy/Sources#sources). This means by enabling it we'd be allowing any data URLs to be used as the source for a worker, which is still a fairly large security nightmare.

Would it be possible to please get an option to disable this worker in high-security environments, with the caveat that accurate timing information will be lost when the tab is minimised? For our specific use-case we're already patching this out and just using a standard browser setTimeout so it would not be a regression for us, and I imagine there are others who would take this trade-off over the added security implications

@glharper
Copy link
Member

glharper commented Jun 14, 2023

@me4502 Thanks for looking at and thinking about this issue. Can the "hash-algorithm"-"base64-value" attribute not being applied to this worker?

@me4502
Copy link
Member

me4502 commented Jun 19, 2023

After testing that, it does appear to work. Thanks for pointing that one out, I was just looking at the data: scheme.

My only concern with this is that when the library updates it's an extra step to ensure that the base64'd hash is updated alongside it, but that's not too major an issue.

@me4502
Copy link
Member

me4502 commented Jun 19, 2023

Actually, we've found that while the hash I setup works on macOS (the OS I initially tested on), the value appears to differ on Windows. Unsure if this is something encoding related, but given it appears to differ here there's a possibility it'll differ further based on other factors. Due to this, this doesn't seem to be a viable fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants