New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate options to make internal usage of eval redundant or optional #1019

Closed
josdejong opened this Issue Jan 17, 2018 · 29 comments

Comments

Projects
None yet
5 participants
@josdejong
Owner

josdejong commented Jan 17, 2018

The expression parser of mathjs has it's own expression parser, which first parses and expression into an AST, and then compiles it into high performing JavaScript code using eval. The usage of eval is great for performance, but it's also a security risk. In some restricted environments usage of eval is blocked for security reasons, which means that you can't use math.js everywhere.

I would love to do some experimenting in different directions (though I don't expect to have time for that anytime soon myself):

  • Maybe nowadays the JavaScript compilers are so good in optimizing code that we may not need eval anymore. For example, small functions will get inlined by the JS engine so we may not need to do that ourselves.
  • If eval remains critical in order to achieve good performance, it would be awesome if we could somehow make this optional: have two packages of mathjs, a slow (safer) one without usage of eval, and a fast one using eval. Not sure how we could do that without having to duplicate the logic in the compiler though.
  • Look into sandboxed environments where eval can be used safely. Though vm.runInContext turned out to be insecure, there are other initiatives which could be interesting to look into: frozen realms, asm.js, ADSafe.

There is a small benchmark file here, which can be used to see whether changes increase or decrease the performance.

Just to give you a little feeling what sort of code the expression generates right now:

math.parse('2+3').compile().eval.toString()
// function (scope) {    
//   if (scope) _validateScope(scope);    
//   scope = scope || {};    
//   return math["add"](2, 3);  
// }

math.parse('2x').compile().eval.toString()
// function (scope) {    
//   if (scope) _validateScope(scope);    
//   scope = scope || {};    
//   return math["multiply"](2, ("x" in scope ? getSafeProperty(scope, "x") : undef("x")));  
// }

math.parse('2 sin(x)').compile().eval.toString()
// function (scope) {
//   if (scope) _validateScope(scope);
//   scope = scope || {};    
//   return math["multiply"](
//     2, 
//     ("sin" in scope 
//       ? getSafeProperty(scope, "sin") 
//       : getSafeProperty(math, "sin"))(("x" in scope ? getSafeProperty(scope, "x") : undef("x")))); 
// }

math.parse('M[2, 1:4]').compile().eval.toString()
// function (scope) {
//   if (scope) _validateScope(scope);
//   scope = scope || {};
//   return access(
//     ("M" in scope ? getSafeProperty(scope, "M") : undef("M")), 
//     math.index(2, range(1, 4, 1))
//   );
// }

I should note that there is another place where eval is used internally: in typed-function, but let's keep that for another discussion.

@josdejong josdejong added the feature label Jan 17, 2018

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 18, 2018

Typo in

Though vm.runInContext turn out to be secure

?

@harrysarson

This comment has been minimized.

Collaborator

harrysarson commented Jan 18, 2018

Generation of WebAssembly code would be pretty neat and presuably result in very fast evaluation.

I think it would also guarantee that mathjs does not allow injection of malicious code when parsing expressions.

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 18, 2018

@harrysarson, Good point. wasm is designed to allow proper sandboxing via importObject during instantiation.

It would only work on Node >= 7 and probably not in browser for a while.
It looks like Node6 is going into maintenance in April so that may not be a problem for long.

It might also allow access to vector instructions via wasm loops.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 18, 2018

Typo in

Though vm.runInContext turn out to be secure

?

oops, fixed that.

I think it would also guarantee that mathjs does not allow injection of malicious code when parsing expressions.

@harrysarson I'm not sure, does it protect against running arbitrary code when you've found access to new Function(...) (via a constructor)? I think so but I'm not familiar with asm.js.

@FSMaxB

This comment has been minimized.

Collaborator

FSMaxB commented Jan 18, 2018

Wouldn't WebAssembly require writing native code (like C, C++ or Rust)?

I don't think WebAssembly is a viable option for now because of limited compatibility, but if emscripten is used to create WebAssembly from native code, the same code can be used to create asm.js for the time being, so this might be a possible direction to go.

@FSMaxB

This comment has been minimized.

Collaborator

FSMaxB commented Jan 18, 2018

@harrysarson I'm not sure, does it protect against running arbitrary code when you've found access to new Function(...) (via a constructor)? I think so but I'm not familiar with asm.js.

@josdejong It seems like you are using WebAssembly and asm.js interchangeably. But they are something quite different.

asm.js is (from my understanding) just a subset of regular JavaScript with some type annotation so that Browser that support it can convert it into efficient machine code.

WebAssembly is (again from my understanding) an API and Instruction set that maps very closely to actual processor instruction sets, so they can be compiled to native instructions before the code is run.

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 18, 2018

@FSMaxB, WebAssembly would require compiling an AST to a stack-based binary instruction set.

My understanding is that asm.js was a proposal that Mozilla came put forward as what they would like instead of NaCl. That proposal showed that there was broad support for something like asm.js but without the JavaScript parser overhead. Most of the people who did spec writing for either NaCl or asm.js are now behind the WebAssembly spec and this is the first such proposal to get support from all major browser vendors.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 18, 2018

@josdejong It seems like you are using WebAssembly and asm.js interchangeably. But they are something quite different.

🤣 ... well... that's illustrative for how much I know about it

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 18, 2018

I'm not sure, does it protect against running arbitrary code when you've found access to new Function(...) (via a constructor)? I think so but I'm not familiar with asm.js.

My understanding is that the validator disallows any call that doesn't provably produce a float and the return type of the Function constructor would be extern.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 18, 2018

I just did a quick and dirty experiment removing compiling of an expression via:

compile(...) {
  return new Function (' string containing javascript code ')
}

to basically:

compile (...) {
  return function (scope) {
    // ... regular js
  }
}

First results look awesome:

expression: 2 + 3 * sin(pi / 4) - 4x

expression parse and evaluate            x 15,935 ops/sec ±3.26% (87 runs sampled)
expression parse and compile             x 17,090 ops/sec ±1.94% (88 runs sampled)
expression parse                         x 62,029 ops/sec ±2.71% (88 runs sampled)
evaluate                                 x 1,922,353 ops/sec ±0.39% (90 runs sampled)

(safe) expression parse and evaluate     x 21,944 ops/sec ±1.64% (91 runs sampled)
(safe) expression parse and compile      x 22,983 ops/sec ±1.87% (92 runs sampled)
(safe) expression parse                  x 65,019 ops/sec ±2.31% (86 runs sampled)
(safe) evaluate                          x 2,432,108 ops/sec ±0.58% (94 runs sampled)

The compiler without eval is actually faster than the current one! I hope to work this out in more detail soon, this looks reaaally promising

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 18, 2018

Wow!

Is ... regular js a tree interpreter?

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 19, 2018

It's basically something like this: the expression parser parses input into an expression tree or AST, which contains nodes like OperatorNode, FunctionNode, ConstantNode, SymbolNode, etc. For example a FunctionNode contains properties like fn (for example sqrt), and args (like a ConstantNode with the value 16).

Without compilation, you could evaluate such a node like this (simplified):

FunctionNode.prototype.eval = function (scope) {
  if (this.fn in scope) {
    return scope[this.fn].apply(null, this.args.map((arg) => arg.eval(scope));
  }
  else {
    return math[this.fn].apply(null, this.args.map((arg) => arg.eval(scope));
  }
}

This works, but for every evaluation a lot of work has to be done (if statements, iterating over arguments, calling compile on child nodes everywhere, ...)

To make this faster, mathjs first compiles into an optimized JavaScript function, something like this,
generating a string with JavaScript code (simplified):

FunctionNode.prototype.compile = function (math) {
  if (this.fn in scope) {
    return (this.fn) + '(' + this.args.map(arg => arg.compile(math)).join(', ') + ')';
  }
  else {
    return math[this.fn].apply(null, this.args.map(function (arg) {
      return arg.eval(scope);
    });
  }
}

// and which generates code which (simplified) looks like say:
//   '("sum" in scope ? scope["sum"] : math["sum"])(2, 3, 4)'

// which output is then wrapped and compiled into JS like:
{
  expr: new Function('scope', 'return ' + tree.compile(math))
} 

When I wrote this compiling step, it resulted in performance improvements in the order of a factor 10.

Now I'm experimenting with returning a function instead of returning a string which will be evaluated as new JavaScript:

FunctionNode.prototype.compile = function (math) {
  var fn = this.fn // for example 'sum'
  var args = this.args.map(arg => arg.compile(math))
  return function (scope) {
    return (fn in scope ? scope[fn] : math[fn]).apply(null, args.map(arg => arg(scope)));
  }
}
// note that we can easily write an optimized version for functions 
// with 1 or 2 arguments for example.

Apparently this new solution has a similar performance (or even slightly better) than the eval version. That looks counter intuitive, because an evaluation involves again more function calls and looping over arrays, which are relatively expensive operations. So... it looks like today's JavaScript engines are just extremely good in optimizing a cascading chain of these little nested functions.

@FSMaxB

This comment has been minimized.

Collaborator

FSMaxB commented Jan 19, 2018

Why not go back to the original approach with prototype.eval? Shouldn't that have the same performance as or even better than the approach with returning a function?

@FSMaxB

This comment has been minimized.

Collaborator

FSMaxB commented Jan 19, 2018

compile could then be a noop that just returns a function that calls node.eval.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 19, 2018

Why not go back to the original approach with prototype.eval? Shouldn't that have the same performance as or even better than the approach with returning a function?

The example I posted is quite simplified, some nodes have a lot of logic which you can execute once in a compile step, as it doesn't yield changes when executing with a different scope. See for example AssignmentNode which has quite some logic. If you do as much work as beforehand in a compile step, and only keep the stuff that depends on the values in the scope dynamic. That make evaluation much faster.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 21, 2018

I've completely rewritten the compile function to one which doesn't use eval. Currently the old compile method is renamed to node.compileWithEval(), and the new implementation is named node.compile(), so we can compare them side by side. I did some small tests on all different browers, and it looks like the new solution (without eval) is slightly faster than the old one. Firefox is an exception, there the new solution is almost twice as slow, but still faster than Chrome.

This is a breaking change, since the _compile functions have a different signature and changes how you have to write custom nodes.

Branch: https://github.com/josdejong/mathjs/tree/compile-without-eval
Commit: 1192bb6
Benchmark: https://github.com/josdejong/mathjs/blob/compile-without-eval/benchmark/expression_parser.js

Note that this doesn't make security a non-issue for the parser at all, but it makes some categories of exploits impossible. It also makes the code easier to read/understand, makes debugging easier (good stacktrackes).

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 21, 2018

Note that this doesn't make security a non-issue for the parser at all, but it makes some categories of exploits impossible.

Is the code below where identifiers in untrusted inputs get resolved to lvalues and rvalues?

return function (scope, args, context) {
return name in scope
? getSafeProperty(scope, name)
: isUnit
? new type.Unit(null, name)
: undef(name);

If so, there might be a few tweaks to SymbolNode that might prevent custom node implementations from naively trusting the name.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 22, 2018

Yes, that's one of them. Basically, many exploits boil down to getting access to Function which allows executing arbitrary JavaScript, like via sqrt.constructor. Therefore, we put guards on all places where we get or set object properties: from scope, from math, or from object accessors. That are these functions getSafeProperty, setSafeProperty, validateSafeMethod that you can find in classes like SymbolNode, AccessorNode, AssignmentNode. I would love to have these guards functions less spread all over the place as it is right now, this makes it hard to overlook. @ThomasBrierley is looking into giving these guard-methods a first refactoring (see #997 (comment)), I'm curious what that will bring.

If so, there might be a few tweaks to SymbolNode that might prevent custom node implementations from naively trusting the name.

What tweaks do you mean exactly?

@mikesamuel

This comment has been minimized.

mikesamuel commented Jan 22, 2018

Basically, many exploits boil down to getting access to Function

This sounds like RCE:

// Let x be any value not in
// (null, undefined, Object.create(null)).
var x = {},
// If the attacker can control three strings
    a = 'constructor',
    b = 'constructor',
    s = 'console.log(s)';
// and trick code into doing two property lookups
// they control, a call with a string they control,
// and one more call with any argument
x[a][b](s)();
// then they can cause any side-effect achievable
// solely via objects reachable from the global scope.
// This includes full access to any exported module APIs,
// all declarations in the current module, and access
// to builtin modules like child_process, fs, and net.

What tweaks do you mean exactly?

Well, if the problem is untrusted identifiers reaching the square bracket operator, then
you could wrap all untrusted identifiers in a class like that below which would require
someone to explicitly think about what they're doing with the identifier but still make
it obvious what's going on in log messages.

class UntrustedIdentifier {
  constructor (identifier) {
    Object.defineProperty(
      this, unsafeId,
      { configurable: false, writable: false, value: String(identifier) })
  }

  toString () {
    // Coercion to string returns something that aids debugging.
    // Searches for UnsafeIndentifier should direct devs to this file. 
    return '(UnsafeIdentifier ' + JSON.stringify(this.id) + ')'
  }

  /** Convenience to read an own property but not anything from a prototype. */
  readFrom (obj) {
    return Object.hasOwnProperty.call(obj, this.id) ? obj[this.id] : undefined
  }
}
@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 22, 2018

Thanks Mike, I like that idea! That will prevent accidentally using an unsafe property or method.

@josdejong

This comment has been minimized.

Owner

josdejong commented Jan 22, 2018

@ThomasBrierley want do you think of Mike's idea?

@ThomasBrierley

This comment has been minimized.

Contributor

ThomasBrierley commented Feb 5, 2018

I think I might be showing my ignorance of mathjs here so keep that in mind for my bellow comments.

My understanding of the methods exported by custom.js was that they are called when accessing properties but I haven't really dug into the context where that happens, I assumed it was automatically applied to every property.

if the problem is untrusted identifiers reaching the square bracket operator, then
you could wrap all untrusted identifiers in a class like that below which would require
someone to explicitly think about what they're doing with the identifier

Would every operator implementer have to make this consideration? or would it be applied in one place? sorry maybe you can enlighten me here... basically my thoughts are if this requires multiple implementors to be thoughtful of security in multiple places it potentially opens up lots of holes from the inevitable little mistakes. On the other hand if it's only applied in one place then extra care only has to be taken in one place.

[EDIT]

Reading backwards up this thread :P so currently the property guards are called via various nodes classes and doesn't have to be considered by implementer of each operator? (Which I think is a good thing from a security point of view at least). Would it be possible to use that identifier wrapper in a centralised manner also?

I think there might be some differences here though i'm not very familiar with the compiler: currently we only check for properties of objects, but an identifier could just be some variable, unless they are just checked as a property of scope?

Sorry lots of questions, i'm trying to catch up with this thread.

@josdejong

This comment has been minimized.

Owner

josdejong commented Feb 5, 2018

@ThomasBrierley currently these security functions like isSafeMethod, getSafeProperty, etc are called in quite some places in the code. In the current solution it is easy to somewhere forget to retrieve a property the safe way. The idea of Mike in #1019 (comment) simply makes it impossible to get an unsafe property unless you're using functions like getSafeProperty on the UntrustedIdentifier. The places where such an UntrustedIdentifier will be created are very limited.

@ThomasBrierley

This comment has been minimized.

Contributor

ThomasBrierley commented Feb 5, 2018

Ahh, well that certainly does sound better.

The natural interface for my new implementation is access(obj, key, mode). Where mode is one of ('set', 'get', 'call'), these modes could be mapped to methods of the UntrustedIdentifier class if preferred.

Also "obj" is currently mandatory... would that pose a problem for some identifiers when referencing a global variable of some kind?

@josdejong

This comment has been minimized.

Owner

josdejong commented Feb 6, 2018

Sounds good. Maybe a single method with three values for mode is a bit abstract? How about an UntrustedIdentifier class with three explicit methods on it for the different types? I suppose in case of 'call' you will also have to pass arguments, and in case of set you will have to pass the new value, so these three methods will have a different signature too.

Also "obj" is currently mandatory... would that pose a problem for some identifiers when referencing a global variable of some kind?

I don't expect so: "global" variables are defined in the math namespace, which is also just an object.

@ThomasBrierley

This comment has been minimized.

Contributor

ThomasBrierley commented Feb 6, 2018

That sounds fine. I can see that from the perspective of actually getting, setting and calling they do have quite different signatures... However from the perspective of purely checking if it's safe it is useful to have a single function with a mode for the sake of keeping the rules organised.

I'll carry on with the refactor as planned but expose a new access type function from custom.js then migrating away from the guard functions to use of an UntrustedIdentifier class can be done separately (Utilising the new access function from the class's get, set and call methods which should be fairly lightweight).

@josdejong

This comment has been minimized.

Owner

josdejong commented Feb 6, 2018

Cool, I'm curious how this approach will work out!

@josdejong

This comment has been minimized.

Owner

josdejong commented Feb 21, 2018

I will close this issue now: mathjs v4 is completely eval-free 😃 I hope to release the final version within a week.

Until a month ago I hadn't dare to dream about getting this far. @mikesamuel thanks for triggering this whole discussion. I'm afraid you will have to rewrite the comments on the usage of eval in mathjs in your Security Roadmap to past perfect 😬

See #682 (comment) and the HISTORY.md file of v4 for details.

@josdejong josdejong closed this Feb 21, 2018

@mikesamuel

This comment has been minimized.

mikesamuel commented Feb 21, 2018

@josdejong Awesome. Updated the roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment