New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytelist love merge #5151

Merged
merged 91 commits into from Apr 25, 2018

Conversation

Projects
None yet
2 participants
@enebo
Member

enebo commented Apr 24, 2018

So the branch is a bit misnamed at this point but I will summarize this a tiny bit (have talked quite a lot to @headius about this already).

all identifiers are made as RubySymbols with exception of annotations (currently assumed as 7bit ASCII -- this can be changed later). RubySymbols have bytelist + 8859_1 identifier string. Parser and IR now use Symbol. rest of runtime continues to use the j.l.String as identifier/intern string. New vocabulary:

  • RubySymbol - name of the identifier
  • String - id of the identifier

RubySymbol.idString() is how you get an id for what you are looking up.

Error reporting has new static str(), names(), and ids() methods for making properly encoded Error strings. This is not 100% completed but many many errors have been converted.

enebo added some commits Dec 12, 2017

Amazingly almost everything still passes (spec:ruby:fast totally does).
This start changing all defined methods to use the raw (8859_1) string as the
key in the method table.  Symbol StringLiteral both now properly use proper
ByteList and does not muck with String.
Simple regression in recent changes. Some trace instr will have a nul…
…l name.

Guard against it since it is a normal value.
Trying this. Seemingly now all symbol(String) in RubySymbol will be 8…
…859_1 or

a raw string.  So I made toString() decode to actual charset based on encoding.
This may need another field if we discover toString() on RubySymbol is called
all the time but let's roll with this made on-demand for now.
Make common place for common bytelists.
Make and/or use isOr/isAnd and leverage AST being able to tell us this vs
  manually figuring it out.
argument parameters use bytelist now (mostly). This allows stuff like:
```text
jruby -e 'p method(def foo(あ); end).parameters'
[[:req, :あ]]
```

to work.  Up til now this would confuse our impl and we may see an error like:

```text
ArgumentError: invalid byte sequence in US-ASCII
  inspect at org/jruby/RubySymbol.java:283
  inspect at org/jruby/RubySymbol.java:268
  inspect at org/jruby/RubyArray.java:1659
  inspect at org/jruby/RubyArray.java:1659
        p at org/jruby/RubyKernel.java:492
   <main> at -e:1
```

Pure ruby code should not properly be giving properly encoded values back. Java
(e.g. native) methods mostly should not matter since they do not return names
but if they do return names then they are all assumed to be UTF-8 encoded data.
Unbreak spec:compiler from merge with master. It was calling the raw …
…field

name but name changed from a String to a ByteList which make new frame read
write code find nothing.
Use raw strings so JIT at least finds the method. This uncovers a wri…
…nkle in

our strategy.  Errors are generated from the search deep in Module but it is a
raw String.  This is somewhat ok for most strings because they just happen to be
utf-8 so the display.  This is a big issue though since we want proper error
message....
Made method storage backed by ByteList instead of String. I knew this…
… day

would come but I was hoping I could cheat with 8859_1 strings enough to merge
for 9.2.  Unfortunately, lack of encoding through method methods means we cannot
accurately throw properly encoded error messages since he have long lost the
encoding of the method.

So with that said, most people will not be able to notice this change.  All the
String methods for methods still exist.  These methods will assume all data is
UTF-8.  This could be a dicey gamble but since it is the default for Ruby it
likely will not impact anyone.  Also since UTF-8 is 7bit clean for ASCII we
should totally not have an issue there.

Some API signatures have changed.  This means 9.2 might end up causing some
issues if you are doing something very low-level with our code base (although
changes up to this point have already broken some APIs within parsing and IR
and that is unavoidable).  The method reader/writer maps needed their return
type changed from <String, DynamicMethod> to <ByteList, DynamicMethod>.  It
was an option to make a new name for methods which returned this but this is
so far down in our APIs I think we should run with it.  Also ProfiledMethod
has changed from String to ByteList (e.g. getName() -> ByteList from String).
Ultimately ProfiledMethod needs to be properly encoded to be able to report
the method in a properly encoded way...so there you have it.

Next phase is to change interpreters and the JIT to stop from going:
ByteList -> String -> ByteList.  I did not do that part here since this is
pretty large already.
throw newTypeError(parent.getName() + "::" + moduleObj.getMetaClass().getName() + " is not a module");
}
RubyString typeName = parentIsObject ?
types(this, moduleObj.getMetaClass()) : types(this, parent, moduleObj.getMetaClass());

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

maybe a throwis lost here - unless its intentional

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

throw is immediately below that line. It was two throws and now is one line since both throws just were assembling a proper typeName.

RubyString str = RubyString.newString(getRuntime(), bytes, UTF8Encoding.INSTANCE);
str.setTaint(isTaint());
return str;
RubyString bytes = runtime.newString("#<");

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

just a q: so the direct ByteList concat wasn't good here, does it get problematic for any reason?
isn't the new way going to be doing a lot of "unnecessary" Java String-> byte[] conversion?

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

That is a good question I don't recall offhand. It is possible there was no issue but I learned that these methods are not quite a sensible as you may think. I will try it out quickly.

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

So I looked at this and I think I can reduce the cost substantially and use a bytelist but there is a specific reason why it works. First when we have m17n data getting combined into a string we cannot use bytelist since bytelist has no idea how to combine two different encoded things together. In this very particular case we only have 7bit ascii strings and something else. So I think I can just mark the new string as same encoding as the something else and it will work....almost anyways. If the encoding used is not isAsciicompat we will generate garbage. OTOH my current code was broken in the same way so we are no worse off.

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

Addendum since I am thinking about this. I should try the last broken part here. I suspect the types() method in RubyStringBuilder is actually what the non-ascii name will display like. So I need to try that.

* Make an instance variable out of this symbol (e.g. :foo will generate :foo=).
* @return the new symbol
*/
public RubySymbol asAccessor() {

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

did you mean asWriter (not having asReader seems as just a coincidence since its the same symbol) ?

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

Ah nice. You're right I should have called this asWriter(). I will change that.

@Override
public final String toString() {
return symbol;

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

this could use an explanation (comment) on why not simply return symbol; ...for mere mortals :)

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

Ah yeah I will add an explanatory comment. Largely toString is doing it's best to make a j.l.String which displays properly but is not perfect since toString() should not be used internally for anything but debugging. Going full RubyString + decode seemed unnecesary here. Perhaps I should also add asJavaString() or using str() is more appropriate if you want to work with a string version of a RubySymbol?

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

overriding asJavaString explicitly (from IRubyObject) sounds good ... for the full decoding cycle (unless its already done so).

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

I also just found a few consumers of toString and changed them to idString or asJavaString depending on whether their context was used in an identifier or non-identifier way (for RubySymbol). I will likely be trying to audit more general toString() usage as we seem to be a bit inconsistent.

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

@kares oh I should point out asJavaString cannot be decoded version of the method since there are still over 100 users using it to lookup methods. We have this pattern:

    public static IRubyObject get_inner_class(final ThreadContext context,
        final RubyModule self, final IRubyObject name) { // const_missing delegate
        final String constName = name.asJavaString();

All our internal tables for identifiers are keyed off of 8859_1. I think we may need to make a slightly more restrictive but helpful method to replace this pattern and ween off of asJavaString (idString on RubySymbol is more correct but only on RubySymbol -- probably need something on RubyString as well). This though need not be resolved immediately. Just letting you know why asJavaString cannot decode today.

As an aside it is a little weird IRubyObject has asJavaString since this means I could pass a fixnum in here and it would end up as "5" or something. From Ruby side of things I think people would be surprised that would work even if there was a "5" method.

public static String str(Ruby runtime, IRubyObject value, String message) {
RubyString buf = (RubyString) value.asString().dup();
buf.cat19(runtime.newString(message));

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

do we still need to be using 19 legacy named methods? maybe cat + cat19 are worth some 'unification' ...

This comment has been minimized.

@enebo

enebo Apr 25, 2018

Member

Unfortunately, they are very different and both should always exist. I would agree with you that we should refactor with better names as cat19 means nothing but I suspect cat19 is still used by multiple native extensions which use m17n strings in any way (although we can definitely come up with a new name and deprecate). Main difference is one could care less about what it is cat'ing and the other realizes it needs encoding negotiation.

This comment has been minimized.

@kares

kares Apr 25, 2018

Member

ah yeah I recall them being impossible to unify now ... thanks

@enebo enebo merged commit 52a1832 into master Apr 25, 2018

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@enebo enebo added this to the JRuby 9.2.0.0 milestone May 24, 2018

@enebo enebo deleted the bytelist_love branch Nov 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment