New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new unic-ucd-common component #172

Merged
merged 3 commits into from Oct 5, 2017

Conversation

Projects
None yet
3 participants
@behnam
Copy link
Member

behnam commented Oct 2, 2017

There are some core properties that are commonly used in Unicode
algorithsm, as well as in applications directly, and are not specific to
any single area. Examples of these properties are Alphabetic and
White_Space.

Also, there are some resolved properties that are used commonly, like
Numeric and Alphanumeric, which are commonly defined as based on
General_Category and Alphabetic properties. Since they are common in
applications, it makes sense to provided optimized implementations.

This new componet, unic-ucd-common, hosts these properties.

@behnam behnam requested a review from CAD97 Oct 2, 2017

@behnam behnam self-assigned this Oct 2, 2017

@behnam behnam added this to the UNIC-0.7 milestone Oct 2, 2017

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 2, 2017

This is needed in implementation of the Segmentation algorithm. Tracker: #135

For the resolved properties put here, another option is to put them directly under unic-segment.

Either way, it should be noted that neither of Numeric nor Alphanumeric are explicitly mentioned in the spec, and it's left to the application to decide how to extract words from boundaries. See http://www.unicode.org/reports/tr29/#Word_Boundaries.

@CAD97
Copy link
Collaborator

CAD97 left a comment

Seems like a good addition; alpha/numeric is something that is commonly referenced.

Though in the case of segmentation:

Numeric: http://www.unicode.org/reports/tr29/#SB_Numeric

And I assume the part of the spec you're referencing is

Letters are not the only characters that can be used to determine the “significant” words; different implementations may include other types of characters such as digits or perform other analysis of the characters.

I don't directly see how these are required for segmentation, but I still think they're worthwhile to expose here.

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 2, 2017

Numeric: http://www.unicode.org/reports/tr29/#SB_Numeric

That's SB::Numeric, as in Sentence Break. The place is_alphanumeric() is called is in Word Break (WB).

The current implementation doesn't have SB char property or algorithm.

But, it's a good idea to provide an API that allows application to choose how they want to detect words. A common implementation (and probably with its own function) would be to use is_alphanumeric(), as defined here.

I don't directly see how these are required for segmentation, but I still think they're worthwhile to expose here.

After Word Boundaries are detected, string is split into many pieces, some of which are only white spaces, punctuation, etc. The iterator to return "words" needs to distinguish between useful pieces and useless ones, in this perspective. This is where has_alphanumeric is used on pieces.

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 2, 2017

Following the lead of the std lib, I think we should put all these 5 functions here:

  • is_alphabetic()
  • is_whitespace()
  • is_alphanumeric()
  • is_control()
  • is_numeric()

UPDATE: The is_digit() function is more complicated than a simple Char Property. Let's skip that one.

This PR adds 3 of them. I can add the rest in another step, or just include here if there's any reason for it.

@CAD97

This comment has been minimized.

Copy link
Collaborator

CAD97 commented Oct 2, 2017

The iterator to return "words" needs to distinguish between useful pieces and useless ones, in this perspective. This is where has_alphanumeric is used on pieces.

Wouldn't this exclude Emoji? I'm pretty sure that what the spec is implying is a check for being whitespace? But then again I've only skimmed over it. I say add in the other two simple components here and then add it.

Add new unic-ucd-common component
There are some core properties that are commonly used in Unicode
algorithsm, as well as in applications directly, and are not specific to
any single area. Examples of these properties are `Alphabetic` and
`White_Space`.

Also, there are some *resolved* properties that are used commonly, like
*Numeric* and *Alphanumeric*, which are commonly defined as based on
*General_Category* and *Alphabetic* properties. Since they are common in
applications, it makes sense to provided optimized implementations.

This new componet, `unic-ucd-common`, hosts these properties.
@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 3, 2017

Okay, I'm adding the other common properties here then before we land.

Wouldn't this exclude Emoji? I'm pretty sure that what the spec is implying is a check for being whitespace? But then again I've only skimmed over it. I say add in the other two simple components here and then add it.

Yeah, that part of the API should definitely be expanded. I'll keep it minimal while I'm importing the existing code, and we will build on top of that. The only thing we need in the API at the moment is to let user choose the filtering for word iterator.

The rest, I think we better keep when we get to string transformation API, which will be in a month or two.

@behnam behnam force-pushed the ucd-common branch from 8c7f914 to ac3365c Oct 3, 2017

[ucd/common] Add White_Space and is_control
Add UCD `White_Space` property.

Add `is_control()` method following the Rust std lib practice for
defining it based on `General_Category = Cc`.

@behnam behnam force-pushed the ucd-common branch from 2456352 to c36ba8e Oct 4, 2017

@CAD97
Copy link
Collaborator

CAD97 left a comment

Looks ready to me. Just one extremely minor stylistic thing inline.

macro_rules! assert_char {
($ch:expr, $x:expr => $y:expr) => (
assert!(
!$x || $y,

This comment has been minimized.

@CAD97

CAD97 Oct 4, 2017

Collaborator

These are logical equivalents, but given the below textual assertion, I think this would read better as !($x && !$y). I think that is closer to the actual assertion here: that if we have $x then $y must be true. Maybe, it's better written explicitly: if $x { assert!( $y, ... ) }

But this is a stylistic choice. I'll leave it up to you.

This comment has been minimized.

@behnam

behnam Oct 4, 2017

Member

Thanks, @CAD97, for the comments!

The reason the precondition is inside the assertion macro to be able to write out a meaningful error on failure. First I was only passing in $y to the macro, then realized that the text needs to know about the precondition, and it doesn't make sense to repeat it on each macro call-site.

About changing the logical form, IMHO the current form is more common and easier to read. The double-negation on the other form makes it harder to follow. (I think it's similar to the case we had last week, where we went for the simpler logical formula.)

There's only one thing that I want to change here, though, which is =>, which is a valid operator for the expressions in this context, and therefore, I want to update it to ==>, -> or -->, which cannot be part of the expressions.

[ucd/common] Add consistency test
Add new integration test to check, for all code-points, consistency
between common properties themselves, and versus General_Category.

@behnam behnam force-pushed the ucd-common branch from c36ba8e to 652f4c0 Oct 4, 2017

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 4, 2017

Okay, updated the macro be expand the support, make the syntax more readable by using if, and remove yet more duplications and improve runtime perf by only checking preconditions one time.

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 4, 2017

Btw, later we should move this to the utils component, to use in various consistency tests we have. Many of them are too verbose at the moment because of not using something like this.

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 4, 2017

bors: r+

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 5, 2017

bors: ping

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

pong

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 5, 2017

bors: status?

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 5, 2017

bors: status

@CAD97

This comment has been minimized.

Copy link
Collaborator

CAD97 commented Oct 5, 2017

It says we're waiting in queue on my side (through the GitHub bit at the bottom)

@CAD97

This comment has been minimized.

Copy link
Collaborator

CAD97 commented Oct 5, 2017

10/4/2017, 5:44:27 PM Crash batch {:timeout, {GenServer, :call, [BorsNG.GitHub, {:get_file, {{:installation, 45082}, 94862886}, {"staging.tmp", ".github/bors.toml"}}, 5000]}}

(source: https://app.bors.tech/repositories/871/log)

@CAD97

CAD97 approved these changes Oct 5, 2017

Copy link
Collaborator

CAD97 left a comment

bors: r+

(retry?)
EDIT: Status page now says running

bors bot added a commit that referenced this pull request Oct 5, 2017

Merge #172
172: Add new unic-ucd-common component r=CAD97 a=behnam

There are some core properties that are commonly used in Unicode
algorithsm, as well as in applications directly, and are not specific to
any single area. Examples of these properties are `Alphabetic` and
`White_Space`.

Also, there are some *resolved* properties that are used commonly, like
*Numeric* and *Alphanumeric*, which are commonly defined as based on
*General_Category* and *Alphabetic* properties. Since they are common in
applications, it makes sense to provided optimized implementations.

This new componet, `unic-ucd-common`, hosts these properties.
@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 5, 2017

Hum, where did you get that log, @CAD97?

The thing is that it shows the PR under "Awaiting review" on the bors dashboard, which is not correct. And looks like there's no command to query the latest status from it.

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 5, 2017

bors: retry

@CAD97

This comment has been minimized.

Copy link
Collaborator

CAD97 commented Oct 5, 2017

Syntax Description
bors r+ Run the test suite, and push to master if it passes. Short for “reviewed: looks good.”
bors r= Same as r+, but the “reviewer” in the commit log will be recorded as the user(s) given as the argument.
bors r- Cancel an r+ or r=.
bors try Run the test suite, without pushing to master.
bors delegate+ Allow the pull request author to r+ their changes.
bors delegate= Allow the listed users to r+ this pull request’s changes.
bors ping Will respond if bors is set up.

(source: https://bors.tech/documentation/reference/)

So no retry/status command. Just re-r-ing it.

Status came from https://app.bors.tech/repositories/871/log.

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

Not awaiting review

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

✌️ behnam can now approve this pull request

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

Not awaiting review

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

Canceled

bors bot added a commit that referenced this pull request Oct 5, 2017

@CAD97

This comment has been minimized.

Copy link
Collaborator

CAD97 commented Oct 5, 2017

..... Bors I did not say that

EDIT: oh... the table confused bors. A lot. 😆

bors: r+

bors bot added a commit that referenced this pull request Oct 5, 2017

Merge #172
172: Add new unic-ucd-common component r=CAD97 a=behnam

There are some core properties that are commonly used in Unicode
algorithsm, as well as in applications directly, and are not specific to
any single area. Examples of these properties are `Alphabetic` and
`White_Space`.

Also, there are some *resolved* properties that are used commonly, like
*Numeric* and *Alphanumeric*, which are commonly defined as based on
*General_Category* and *Alphabetic* properties. Since they are common in
applications, it makes sense to provided optimized implementations.

This new componet, `unic-ucd-common`, hosts these properties.
@CAD97

This comment has been minimized.

Copy link
Collaborator

CAD97 commented Oct 5, 2017

(This is a hot mess, sorry...)

@behnam

This comment has been minimized.

Copy link
Member

behnam commented Oct 5, 2017

No worries. Looks like "It Who Cannot Be Named" is back to work, at least.

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Oct 5, 2017

@bors bors bot merged commit 652f4c0 into master Oct 5, 2017

5 checks passed

bors Build succeeded
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@notriddle

This comment has been minimized.

Copy link

notriddle commented Oct 5, 2017

I updated the reference to describe how to not do that. Thanks for helping me get the docs into shape! 😝

@behnam behnam deleted the ucd-common branch Oct 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment