Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Across the board improvements #10

Merged
merged 24 commits into from
Sep 4, 2020
Merged

Across the board improvements #10

merged 24 commits into from
Sep 4, 2020

Conversation

tomaarsen
Copy link
Contributor

@tomaarsen tomaarsen commented Sep 3, 2020

Hello Dibya,

I've been looking for projects exactly like this one, that do something concise and interesting with text. The fact that this project was shown to have a few small flaws made it even more appealing to me. Over the past few days I've played around with your work, and found some ways to improve its results.
This kind of work is very interesting to me, and maybe we can learn a thing or two with the work I did.

I'll give a short overview of what I've done before diving into details.

Improvement Overview

    1. Added a Testing Suite to track results. Run it with python test_word_forms.py.
    1. Fixed strange case sensitivity for some nouns. e.g.:
get_word_forms("death") => {
 'n': {'death', 'dying', 'dice', 'Death', 'dyings', 'die', 'deaths', 'Deaths'},
      ...
}
    1. Resolved some of the incorrect pluralisation provided by inflect: No more "politicss".
    1. Resolved some issues with weird unrelated results, such as:
get_word_forms("verb") => {
  'a': {'verbal'},
  'n': {'wordings', 'wording', 'word', 'verbs', 'verb', 'words'},
  'r': {'verbally'},
  'v': {'verbified',
        'verbifies',
        'verbify',
        'verbifying',
        'word',
        'worded',
        'wording',
        'words'}
}

This is now:

get_word_forms("verb") => {
  "a": {"verbal"},
  "n": {"verbs", "verb"},
  "r": {"verbally"},
  "v": {"verbifying", "verbified", "verbify", "verbifies"},
}
    1. Resolved words like "ran", "am" and "was" not returning any values. The old system returns:
get_word_forms("ran") => {'n': set(), 'a': set(), 'v': set(), 'r': set()}
get_word_forms("run") => {
  'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
  'a': {'running', 'runny'}, 
  'v': {'running', 'run', 'runs', 'ran'}, 
  'r': set()
}

This is now:

get_word_forms("ran") => {
  'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
  'a': {'running', 'runny'}, 
  'v': {'running', 'run', 'runs', 'ran'}, 
  'r': set()
}
get_word_forms("run") => {
  'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'}, 
  'a': {'running', 'runny'}, 
  'v': {'running', 'run', 'runs', 'ran'}, 
  'r': set()
}
    1. Heavily improved performance of importing the module. The following program
from time import time
t = time()
from word_forms.word_forms import get_word_forms
print(f"It took {time() - t:.4f}s to load the module")

used to output

It took 10.1868s to load the module

and now it outputs

It took 2.7437s to load the module

In addition to these fixes, this pull request solves issues #2 and #3.


Detailed changes

I've split up my work in small commits, so that with each commit the reasoning and changes are concise and hopefully followable. I would heavily recommend going through each commit in order to see what I've done, rather than immediately jumping to see the overall effect they had on the two main files of your project.

I'll go through each of my commits and tell you my reasoning and what they did.

  • 885445c Added a test suite with a bunch of words and their (expected) results. These were manually edited to remove any errors, such as "politicss".
  • ead7285 Rather than looping over all synsets, and then over all lemmas, to get all words, I now call nltk.corpus.words.words(), which gives you (if I recall correctly) some 240000 words rather than the previous 180000, and in considerably less time. This is responsible for improvement vi.
  • 0f6d9b1 Commit title and description speak for itself.
  • 4f848e4 The function get_related_lemmas() is more intuitively implemented as a recursive function. I keep track of a list of known lemmas. When the function is called, it will take the word parameter, get the lemmas for that word exactly, add those to known_lemmas if not already present, and recursively call get_related_lemmas() again with lemmas related to the original word. Each of these recursive calls will add more entries to known_lemmas, which is then returned.
    Note that at this time (this will change in a later commit), this function is identical in functionality, it just has a different implementation.
  • 8a05104 Commit title and description speak for itself.
  • fe740e6 Slight simplification using .copy() on a list rather than using a list comprehension to copy a list.
  • 2e8c2a9 Moved away from iterating over a nested list, and using a dictionary instead. Previously, for each verb you iterated over the entire CONJUGATED_VERB_LIST, and then for each nested list checked whether the verb was in that nested list. This is a very slow operation, one which does not need to be this slow. Now, I use a dict that points to an instance of the Verb class, which holds a set of strings. To illustrate the change, I'll show the old and new representations:
CONJUGATED_VERB_LIST = [
  ['convoluted', 'convoluting', 'convolute', 'convolutes'],
  ['fawn', 'fawned', 'fawns', 'fawning']
]
v1 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
v2 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
CONJUGATED_VERB_DICT = {
  'convoluted': v1, 
  'convoluting': v1,
  'convolute': v1, 
  'convolutes': v1, 
  'fawn': v2, 
  'fawned': v2, 
  'fawns': v2, 
  'fawning': v2
}

In the old system, we need:

for verb in verb_set:
    for word_list in CONJUGATED_VERB_LIST:
        if verb in word_list:
            for word in word_list:
                related_word_dict["v"].add(word)

You can count the amount of nested loops for yourself.
Now we only need:

for verb in verb_set:
    if verb in CONJUGATED_VERB_DICT:
        related_word_dict["v"] |= CONJUGATED_VERB_DICT[verb].verbs

This is considerably faster. Note that |= is a set union operation.

  • 30c8282 The changes in line 35 are reverted later on. Furthermore, now every pertainym is used in ADJECTIVE_TO_ADVERB, rather than just the first one. Other than that, word_forms is just slightly optimized to not need a list comprehension.
  • a9cd983 This is a very interesting commit. It uses something very similar to the difflib.get_close_matches() used in constants.py: difflib.SequenceMatcher. Now, new lemmas found in get_related_lemmas will only be considered if they are deemed at least 40% similar to the previous lemma. This will avoid jumps like "verbal" -> "word" and "genetic" -> "origin".
  • 7f6d5d2 Commit title and description speak for itself.
  • 4bd4cd5 Commit title and description speak for itself.
  • d1c6340 Commit title and description speak for itself. This is responsible for improvement ii.
  • b2a9daf Rather than blindly adding the plural of any noun, we ensure that the pluralized noun does not end with "css" to avoid "geneticss" and "politicss". We override this change later in commit 2ea150e.
  • 80d27bf Commit title and description speak for itself.
  • 5f31968 Commit title and description speak for itself.
  • 77b9c25 Commit title and description speak for itself.
  • 0cf0628 Commit title and description speak for itself.
  • 359fc0c Rather than turning words inputted to get_word_forms to singular, we now use the NLTK WordNetLemmatizer() to lemmatize the input word for nouns, verbs, adjectives and adverbs. With this change, we get "{run}" as words when the input was "ran", or {"be", "am"} if we input "am". Then, for each of these words we call get_related_lemmas, and duplicates lemmas are removed. This is responsible for improvement v.
  • 18eeba6 Commit title and description speak for itself.
  • 2ea150e Improves on commit b2a9daf. Uses a simple regular expression to detect whether the pluralised noun ends with a consonant followed by "ss". Commit title and description speak for itself. This is responsible for improvement iii.
  • c832b6b Commit title and description speak for itself.
  • 384f298 Commit title and description speak for itself.
  • e71f64f Now that the program output has been improved, some of the examples are outdated. I've updated them, and added the now relevant examples of "am" and "ran". The Contribution section is also updated to reflect that a bug was now fixed.
  • d11297e Commit title and description speak for itself.

Tests

The modified program passes all provided tests. The original program will fail 9 of them. The output of the tests on the original program is provided here:
original_test_output.txt

I've tested my changes with Python 3.7.4. Whether the program is still Python 2 compatible I do not know.

Potential additions/considerations for you

  • Add tests to setup.py so tests can be run using setup.py.
  • Check/confirm Python 2 compatibility.
  • Consider adding a contributors section in the README with contributors.
  • Consider checking the README to see if it still has any inconsistencies with the new version, e.g. if it warns for bugs that no longer exist.
  • Note that the pluralisation by inflect can still cause issues, as it comes up with words like "runninesses" as the plural of "runniness".

I've had a lot of fun messing with this project, but I recognise there are a lot of changes proposed in this pull request. If you feel like the spirit of the original version is lost in some way if this pull request was accepted, then I will turn my version into a standalone fork so people can use it if they'd like, with this project preserved like it is.

Let me know if you need anything else from me.

Tom Aarsen

Also includes some tests that currently fail, but are not ran.
words.words() has more words than with the previous method, and is significantly faster to fetch.
Uses any() with a generator for improved efficiency
get_related_lemmas is now a recursive implementation
singular_noun(noun) returns False if the noun is already singular. We can use this to avoid having to check in ALL_WORDNET_WORDS and accidentally returning a boolean
We no longer loop over the entire list of conjugated word lists, and then checking if our verb is inside of this smaller list, for each verb. Instead, the Conjugated verb list is now a dictionary. Each of the conjugated verbs are used as a dictionary to point to the same Verb object, which stores the small list of related conjugated verbs.
ADJECTIVE_TO_ADVERB default dictionary simplified to remove cases where the key and value are the same, as these don't end up doing anything for us.
Pertainyms is now a set to avoid unnecessary duplicates.
Now each of the pertainyms are added in the dict, rather than just the first.
…to the original lemma

This avoids sudden jumps from "verbal" -> "word" or "genetic" -> "origin"
Using the black formatter
The two remaining cases in failed_test_values are left untested, because the bugs that cause them have not been attempted to be fixed.
Avoid eg "President" in addition to "president", same with "Death".
"politics" and "genetics" are given an extra "s" when passed through `inflect.engine().plural_noun()`. We want to avoid this "css", which I think cannot naturally occur.
Note that these tests are generated using the program itself, and then manually checked whether they seem accurate. This means that the tests are for confirming that the program does not add entries that definitely should not be in there, and not necessarily for getting *all* entries it should get.
This dict should contain these items with the same key and value, unlike what I previously thought
There are some cases of strange nouns, eg "littlenesses".
…tion

This lemmatizing is done as nouns, adjectives, verbs and adverbs, and related lemmasfor all of these lemmas are taken.
With these changes, "am" will lemmatize into "am" and "be", which will then both be passed into the rest of the system.
Some of these test cases would fail without the previous change.
Rather than just checking for whether the plural ends in "css", we check whether the plural ends in a consonant followed by "ss", while the original noun did not.
This causes "politicss", "geneticss" or "organisationss" to be dropped.
Error 1 in the README is also mostly resolved, it would appear.
Also added the relevant examples for "am" and "ran", and updated Contributions to remove a now fixed  bug
Call the recursive version that fills `related_lemmas` instead
@gutfeeling
Copy link
Owner

@cubiedev

Hi Tom,

It is humbling to see that you have taken such an interest in my package and put so much effort into improving it.

I went through your comments and agree with a lot of what you have done. To be honest, I don't have the bandwidth to go through the commits and review each of them. So here's what I am going to do: I am going to merge in your commits without reviewing and trust that you know what you are doing. I am going to release a new major version 2.0.0 to reflect these changes. I think it's better to merge in exciting new changes fast than to make good work (and the people who can potentially benefit from it) wait for some stupid bureaucracy (reviews and my opinions). After all, this is not corporate software ;-)

I am also gonna include you prominently as a contributor in the README.

Let me know if I can support you in some other way.

Big thanks!

  • Dibya

@gutfeeling gutfeeling merged commit b96c14a into gutfeeling:master Sep 4, 2020
This was referenced Sep 4, 2020
@tomaarsen
Copy link
Contributor Author

@gutfeeling

Hello Dibya,

I'm happy to hear it! And I wouldn't expect you to spend time to go through each of the commits, don't worry. I'm glad we were able to merge this!
Thank you for the mention as a Contributor in the README and Version update, I appreciate it!

This has been really fun.

Tom Aarsen

@chrisjbryant
Copy link

Great to see such a big upgrade! I've been following this library since the start. :-)

I just wanted to mention that I noticed you use difflib.SequenceMatcher for character similarity when the python-Levenshtein library is MUCH faster. I know this because someone made the same suggestion in one of my own libraries, and I'd definitely recommend it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants