-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop text lemmatize function #3257
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Sion1225, your new AutoLemmatizer class seems useful. Here are a few suggestions:
The new class can actually handle several sentences at once, so you might eventually consider a more informative name for it, like for ex. TextLemmatizer. And the name of the auto_lemmatize() function could be shortened to just lemmatize().
Also, since the innermost code in the for loop is identical to WordNetLemmatizer().lemmatize(), it might be simpler and clearer to call that function explicitly.
for i, word in enumerate(words):
lemma = WordNetLemmatizer().lemmatize(
word, self.pos_word_dict.get(pos_tags[i][1], "n")
) # word.lower() can be used but it is trade-off problem
lemma_list.append(lemma)
@ekaf As mentioned in the text, this task can take a long time to complete and can be processed using pipeline multiprocessing. If multiprocessing is considered beneficial, I have two ideas:
For method 1:
For method 2:
Alternatively, we could use method 1 internally by parsing based on the What do you think about this implementation? If you think it's unnecessary, that's perfectly fine too. :) |
Thanks again @Sion1225, your contributions are very welcome! To accomodate ML, I guess that support for multiprocessing is actually needed, but it could eventually wait for a future PR. What about implementing your function as a method of WordNetLemmatizer, instead of defining a new class? It could be advantageous to group all the methods that use morphy into one class, like in the following example. I moved the imports inside the function, and simplified the construction of the lemma_list:
|
It is also possible to shorten the function even further. Some professors don't appreciate this concise style, but it tends to be popular in nltk. In my opinion, concise code is easier to understand, debug and maintain:
For multiprocessing, it might even be preferrable to make it a generator, using yield from instead of return. |
Thank you for your comments and appropriate advice @ekaf ! So, you are suggesting that we implement this new feature first and leave multiprocessing as a milestone for later? I also contemplated whether implementing it as a method within Thank you for restructuring the code using list comprehension to make it more efficient without overhead. Regarding code style, I don't have a strong preference, but I will follow your suggestion and modify it according to the NLTK style. Also, thank you for the concise revisions to the function descriptions in the code! The current example provided is from Don Quijote, but do you have any ideas for additional examples that might be beneficial to include? |
If your method does not use any information from wordnets, I believe it would be very missleading to implement it a method of the WordNetLemmatizer class. Unfortunately, the WordNetLemmatizer is not a subclass of a general Lemmatizer class, but making a Lemmatizer basic class would be great so WordNetLemmatizer and your class could be subclass of it. On the other hand, better find a more informative name instead of |
@Sion1225, there is no reason to worry about your dictionary,
But there is a severe weakness in our approach, because it just returns the lemmas, while we also need the Pos if we want to use the output for something useful, like for ex. retrieving WordNet senses. One use case I have in mind is lemmatizing the Brown Corpus, and comparing our output with the official Brown tags. Or what about the Gloss Corpus, @arademaker? I fear that our performance may not be very impressive, because of known issues with the lemmatize function, but it would be interesting to get some statistics. |
@ekaf That's a great method! It's something I hadn't thought of. It's definitely more concise. I'm learning a lot. To merge it as a sub-method of @arademaker Our new function internally uses the |
I have started to measure the accuracy of this approach against the gold standard tags in the whole Brown Corpus, but I still need to tweak the tag2pos() function in order to accomodate some non-standard Brown Pos tags, like 'B' for the auxilary "be", and 'H' for "have". Nevertheless, it is already clear that we'll end up with more than 90% accuracy for the Pos tagging. So it is reasonable to base the lemmatization on the present pos tagging. However, this does not mean that we should prefer the WordNetLemmatizer's lemmatize() function. And I absolutely agree with @arademaker's objection against misusing the WordNet name. |
@Sion1225, this PR should be converted to draft (using the link in the upper right corner), as long as it is "Still in progress", and not ready for merging. |
@ekaf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sion1225, please note that your old doctests fail, now that the output is tuples that include the tags. This will become important when finalizing this PR, because doctests are required to succeed before merging. But you don't need to worry about it now, because you might also want to use the tuple2str function from nltk.tag to present the output in a prettier form in the doctest.
Would you consider calling the pos_tag function using "universal" as the value of the optional tagset parameter? The Universal Tagset is very convenient, because the nltk.tag module provides a map_tag function that can map many other tagsets to Universal Tags. This is handy for my current evaluation against the Brown tagset, and probably many other uses. So it would be more useful to return the Universal Tag instead of the WordNet pos.
If you want, I could push a commit, replacing tag2pos with this helper function:
def universal_tag_to_wn_pos(tag):
"""Convert Universal Tag to WordNet Pos.
Return None when WordNet does not cover the Pos"""
return {"NOUN": "n", "VERB": "v", "ADJ": "a", "ADV": "r"}.get(tag, None)
@ekaf Yes, it definitely seems that the Universal Tagset is more useful. I would be happy if you could push the commit! Also, as you mention, I forgot to change the output example in the method comments. If the output changes again, I think that part will need to be revised as well. Would you please take care of that too? |
Ok, @Sion1225, I moved the universal_tag_to_wn_pos() out of the class, because it can also be useful elsewhere. Maybe it even should go into the corpora/readers/wordnet.py module instead? |
@ekaf I reviewed the code. Thank you for your modifications and push! |
Here is a typical example of my evaluation of the Pos tagging, covering the first three sentences from the Brown Corpus: out-brown.txt. These examples are typical, because the Pos differences are often between NOUN and ADJ. |
Thank @ekaf for examples to understand. I realised this is because some words can have multiple parts of speech, such as both nouns and adjectives. |
Yes @Sion1225, the problem is that the most frequent words are often ambiguous. The wn._morphy() function returns a list with all matches, but among those, lemmatize() picks the shortest word. I don't mind if you prefer to present tuples in the docstring, just feel free to make any adjustment you wish. |
I modifed comments and I think now it's readied for pull request. And, @ekaf I thought about |
Thanks @Sion1225! Yes, it seems ready now. |
I have been working with natural language processing and often needed to know which words were used in certain corpora. Many dictionaries are comprised of word stems, requiring the extraction of stems from sentences. For example, specific words can be key clues or carry important information, necessitating the extraction of sentences using these words, or processing sentence information with them.
In this context, I have developed a class called AutoLemmatizer in stem/wordnet.py that automatically performs tokenization and part-of-speech-based lemmatization, returning the lemmas of all words used in a sentence.
I also considered converting 'n't' to 'not,' but have not implemented it yet.
For big data, this process takes long time, and this task can be processed by multi-processing pipeline. if you interest, please let me know. I'm developing this.
I would be pleased if this contribution is considered useful and adopted.
issue: #3256