Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests look strange #47

Closed
justafucker opened this issue Jun 18, 2015 · 2 comments
Closed

Tests look strange #47

justafucker opened this issue Jun 18, 2015 · 2 comments

Comments

@justafucker
Copy link

I'm looking at https://github.com/mailgun/talon/blob/master/tests/signature/learning/helpers_test.py

And don't understand why in

'Sergey N.  Obukhov <serobnic@xxx.ru>': ['Sergey', 'Obukhov'],

the expected result doesn't include 'serobnic'

@justafucker
Copy link
Author

What is also unclear to me is that in https://github.com/afedosenko/talon/blob/master/tests/signature/learning/featurespace_test.py

s = '''John Doe
VP Research and Development, Xxxx Xxxx Xxxxx
555-226-2345
john@example.com'''
    sender = 'John <john@example.com>'
    features = fs.features(sender)
    result = fs.apply_features(s, features)
    # note that we don't consider the first line because signatures don't
    # usually take all the text, empty lines are not considered
    eq_(result, [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
                 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
                 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

the last line contains 'john' which mean the last '0' should be '1'

@justafucker justafucker changed the title Helper tests look strange Tests look strange Jun 18, 2015
@obukhov-sergey
Copy link
Member

Hi @justafucker. Sorry for confusion and thanks for your interest / questions. Will try to explain them.

The 1st test checks that ['Sergey', 'Obukhov'] will be among extracted names - not that they are the only ones extracted. E.g. if you modify the test and add serobnic to the list the test will pass as well.

There is a test that specifically checks that given sergey@xxx.ru we'll extract sergey: https://github.com/mailgun/talon/blob/master/tests/signature/learning/helpers_test.py#L103

But we definitely encourage you to submit a PR if you find tests / code confusing and wish to contribute / improve them.

Regarding your 2nd question. The algo looks for lines like "John Doe" or "John" or "Doe" i.e. a line should end with extracted name or extracted name should be a detached word. This requirement might seem strange in respect to "john@example.com" but in general it helps to avoid false positives when extracted name happens to be some general sequence of chars that might occurs in a line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants