Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-18040 Provides unicode implementations for excludeNthWord #10269

Merged
merged 1 commit into from Jul 25, 2017

Conversation

Projects
None yet
4 participants
@dskaff
Copy link
Contributor

commented Jul 19, 2017

  • Adds interface definition to ecllibrary/std/uni.ecl
  • Adds test cases to mirror string version of excludeNthWord
  • Adds code that executes excludeNthWord with unicode capabilities

Signed-off-by: David Skaff David.Skaff@lexisnexisrisk.com

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Testing:

@dskaff

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2017

@ghalliday please code review.

@ghalliday
Copy link
Member

left a comment

@dskaff a good first go. There are various comments from minor to potentially more significant.

If the effect of this change is to highlight ways of improving getNthWord as well, then that will be useful - particularly since that function is more likely to be used.

//Test latin diacritical marks
EXPORT Test26 := ASSERT(Uni.ExcludeNthWord(U'À à',2)+U'!' = U'À !');
EXPORT Test27 := ASSERT(Uni.ExcludeNthWord(U'ä̰́ Ä̰́',2)+U'!' = U'ä̰́ !');
//Test Chinese. "I am a computer" --> "I am"

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

A good set of tests. I would suggest a couple more:
commas and other punctuation: 'abc, def, ghi, jhi' 'abc := name'
diacritics within words

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

Having looked at the string version I see that it only treats characters < 0x20 as break characters. This is probably a point where the string version and the unicode version diverge. @johnholt what would you expect?

//Test Chinese. "I am a computer" --> "I am"
EXPORT Test28 := ASSERT(Uni.ExcludeNthWord(U'我是電腦',2)+U'!' = U'我是!');
/*
//arabic lettering with accent marks. Bidirectional.

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

You should add a comment to explain why these tests are commented out.

@@ -0,0 +1,65 @@
/*##############################################################################

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

whitespace: this line and the other blank lines in the comment have trailing spaces. (git diff HEAD^ will highlight them)

@@ -0,0 +1,65 @@
/*##############################################################################

HPCC SYSTEMS software Copyright (C) 2012 HPCC Systems®.

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

minor: new files should update the date

* Lacks bidirectional capability.
*
* @param text The string to be broken into words.
* @param n Which word should be returned from the function.

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

removed from the string

@@ -718,6 +719,48 @@ unsigned unicodeEditDistanceV4(UnicodeString & left, UnicodeString & right, unsi
return da[mask(leftLen-1)][rightLen-1];
}

UnicodeString excludeNthWord(RuleBasedBreakIterator& bi, UnicodeString const & source, unsigned n)
{
UnicodeString processed;

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

Efficiency: Think about the number of UnicodeString that are created - each one will clone the string on the heap. I think 3 UnicodeStrings are created, and it is currently likely to clone the string 2 or 3 times, when 1 could be done instead.

wordBeginning = 0;
}
else {
wordBeginning = bi.preceding(idx);

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

Efficiency: Can you avoid doing this by saving values from previous iterations?

This comment has been minimized.

Copy link
@dskaff

dskaff Jul 20, 2017

Author Contributor

By this do you mean set wordBeginning's value to the idx before the nth word? @ghalliday

{
UErrorCode status = U_ZERO_ERROR;
Locale locale(localename);
RuleBasedBreakIterator* bi = (RuleBasedBreakIterator*)RuleBasedBreakIterator::createWordInstance(locale, status);

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

Efficiency: Note this section in the ICU documentation: http://userguide.icu-project.org/boundaryanalysis
"Reuse

It is slow and wasteful to repeatedly create and destroy a BreakIterator when it is not necessary. For example, do not create a separate BreakIterator for each line in a document that is being word-wrapped. Keep around a single instance of a line BreakIterator and use it whenever a line break iterator is needed.
"
I know you based this on an existing function. That was written 6 years ago before we had a good code review system in place. I think it might be best to leave it as it is, but fix it with a separate pull request.

UnicodeString uText(text, textLen);
UnicodeString processed = excludeNthWord(*bi, uText, n);
delete bi;
if(processed.length()>0)

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

indent: extra indentation (because of a tab character)

tgt = 0;
}
}

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 20, 2017

Member

whitespace: extra whitespace at the end of the file. Both githooks and "git diff HEAD^ --check" can be used to find the problem.

@dskaff

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2017

@ghalliday please code review the latest commit.

@ghalliday
Copy link
Member

left a comment

@dskaff looking good. The test cases are very good.

A few minor layout issues, and a couple of potential efficiency improvements.

@@ -382,4 +382,19 @@ EXPORT unsigned4 WordCount(unicode text, varstring localename = '') :=
EXPORT unicode GetNthWord(unicode text, unsigned4 n, varstring localename = '') :=
lib_unicodelib.UnicodeLib.UnicodeLocaleGetNthWord(text, n, localename);

/**
* Returns everything except the nth word and, depending on the nth word, some whitespaces from the string. Words are marked by the unicode break semantics.

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 21, 2017

Member

minor: We generally aim to have lines less than 120 characters. Not always enforced, but it would be worth reformatting this so the long lines were a bit shorter.

int32_t idx = bi.first();
int32_t wordidx = 0;
unsigned wordBeginning = 0;
while (idx != BreakIterator::DONE) {

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 21, 2017

Member

style: The general style is for curly brackets to be on the following line. The function you started from doesn't follow that style. So while, if, do in this function should be on a new line.

@@ -718,6 +719,43 @@ unsigned unicodeEditDistanceV4(UnicodeString & left, UnicodeString & right, unsi
return da[mask(leftLen-1)][rightLen-1];
}

UnicodeString excludeNthWord(RuleBasedBreakIterator& bi, unsigned textLen, UChar const * text, unsigned n)

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 21, 2017

Member

This is an improvement, but the return is likely to involve an extra clone of the string - unless the c++ compiler is very clever.

You could do the following instead:

void excludeNthWord(RuleBasedBreakIterator& bi, UnicodeString & source, unsigned n)
...
                source.removeBetween(wordBeginning, wordEnd);
                return;
...
    UnicodeString processed(text, textLen);
    excludeNthWord(*bi, processed, n);

}
wordBeginning = idx;
idx = bi.next();
if (idx == BreakIterator::DONE && wordidx < 1) {

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 21, 2017

Member

I think you can simplify the code by moving these two conditions outside the loop.

@dskaff dskaff force-pushed the dskaff:excludeNthWord branch from 1d7a606 to 21dc283 Jul 21, 2017

@dskaff

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2017

@ghalliday Should be ready for merge.

@dskaff dskaff force-pushed the dskaff:excludeNthWord branch from 21dc283 to 62f8ab2 Jul 21, 2017

@dskaff

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2017

Edited commit message to not contain the messages of several previous commits. (my fault for not editing it when merging them together)

@dskaff dskaff closed this Jul 21, 2017

@dskaff dskaff reopened this Jul 21, 2017

@hpcc-jirabot

This comment has been minimized.

Copy link

commented Jul 21, 2017

https://track.hpccsystems.com/browse/HPCC-18040
Jira not updated (state was not active or new)

@ghalliday
Copy link
Member

left a comment

@dskaff 3 very trivial comments, otherwise ready to merge.

bi.setText(source);
int32_t idx = bi.first();
int32_t wordidx = 0;
unsigned wordBeginning;

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 24, 2017

Member

trivial. It would be worth assigning 0 to wordBeginning. The value will never actually be used, but it will avoid some false positives from programs like Coverity which check the correctness of the program - since there is a path through the function where wordBeginning is not initialised (even though the path can never be taken).

{
source.removeBetween(bi.first(), bi.last());
}
return;

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 24, 2017

Member

trivial: No need for a return a the end of a function. Please remove.

UnicodeString processed(text, textLen);
excludeNthWord(*bi, processed, n);
delete bi;
if(processed.length()>0)

This comment has been minimized.

Copy link
@ghalliday

ghalliday Jul 24, 2017

Member

trivial: Normal to have a space between the if and the bracket.

@dskaff dskaff force-pushed the dskaff:excludeNthWord branch from 62f8ab2 to a583bd0 Jul 24, 2017

David Skaff
HPCC-18040 Provides unicode implementations for excludeNthWord
- Adds interface definition to ecllibrary/std/uni.ecl
- Adds test cases to mirror string version but with Unicode capability
- Adds code that executes excludeNthWord with unicode capabilities

Signed-off-by: David Skaff <David.Skaff@lexisnexisrisk.com>

@dskaff dskaff force-pushed the dskaff:excludeNthWord branch from a583bd0 to 5089651 Jul 24, 2017

@HPCCSmoketest

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

Automated Smoketest:
Sha: 5089651
Build: success
Build: success
ECL Watch: Rebuilding Site

errors warnings build time
0 75 43.431 seconds

Install hpccsystems-platform-community_6.5.0-trunk0.el7.x86_64.rpm
HPCC Start: OK

Unit tests result:

Test total passed failed errors timeout
unittest 95 95 0 0 0
wutoolTest(Dali) 19 19 0 0 0
wutoolTest(Cassandra) 19 19 0 0 0

Regression test result:

phase total pass fail
setup (hthor) 11 11 0
setup (thor) 11 11 0
setup (roxie) 11 11 0
test (hthor) 744 744 0
test (thor) 639 639 0
test (roxie) 772 772 0

HPCC Stop: OK
HPCC Uninstall: OK

@ghalliday ghalliday merged commit 440b3aa into hpcc-systems:master Jul 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.