Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop text lemmatize function #3257

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

Sion1225
Copy link

@Sion1225 Sion1225 commented May 21, 2024

I have been working with natural language processing and often needed to know which words were used in certain corpora. Many dictionaries are comprised of word stems, requiring the extraction of stems from sentences. For example, specific words can be key clues or carry important information, necessitating the extraction of sentences using these words, or processing sentence information with them.

In this context, I have developed a class called AutoLemmatizer in stem/wordnet.py that automatically performs tokenization and part-of-speech-based lemmatization, returning the lemmas of all words used in a sentence.

  • I also considered converting 'n't' to 'not,' but have not implemented it yet.

  • For big data, this process takes long time, and this task can be processed by multi-processing pipeline. if you interest, please let me know. I'm developing this.

I would be pleased if this contribution is considered useful and adopted.

>>> from nltk.stem import AutoLemmatizer
>>> auto_wnl = AutoLemmatizer()
>>> print(auto_wnl.auto_lemmatize('Proverbs are short sentences drawn from long experience.'))
['Proverbs', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']
>>> print(auto_wnl.auto_lemmatize('proverbs are short sentences drawn from long experience.'))
['proverb', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']

issue: #3256

@Sion1225 Sion1225 changed the title Developauto lemmatize function Develop auto lemmatize function May 21, 2024
Copy link
Contributor

@ekaf ekaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Sion1225, your new AutoLemmatizer class seems useful. Here are a few suggestions:

The new class can actually handle several sentences at once, so you might eventually consider a more informative name for it, like for ex. TextLemmatizer. And the name of the auto_lemmatize() function could be shortened to just lemmatize().

Also, since the innermost code in the for loop is identical to WordNetLemmatizer().lemmatize(), it might be simpler and clearer to call that function explicitly.

        for i, word in enumerate(words):
            lemma = WordNetLemmatizer().lemmatize(
                word, self.pos_word_dict.get(pos_tags[i][1], "n")
            )  # word.lower() can be used but it is trade-off problem
            lemma_list.append(lemma)

@Sion1225
Copy link
Author

Sion1225 commented May 22, 2024

@ekaf
Thank you for giving me the chance to contribute!
I have pushed the changes by incorporating your advice. Thank you again for advice.

As mentioned in the text, this task can take a long time to complete and can be processed using pipeline multiprocessing.
What do you think about applying multiprocessing or multithreading?
It might be better to leave such multiprocessing decisions to the user's discretion (not implement it).


If multiprocessing is considered beneficial, I have two ideas:

  1. For a 2D list where strings are classified based on line breaks, process the three tasks (tokenize, pos_tag, lemmatize) sequentially so that as soon as one sentence is processed, it moves to the next stage.

  2. For a 1D list as it is now, proceed to the next stage by receiving each processed word one by one during tokenization.

For method 1:

  • Advantages: It can be immediately applied to datasets already classified by sentence. Even for datasets not classified by sentence, parsing based on \n is easy.
  • Disadvantages: Receiving a 2D list might reduce usability intuitiveness.

For method 2:

  • Advantages: Maintains the intuitive usage as it is now.
  • Disadvantages: Due to frequent inter-thread communication overhead, I cannot predict whether the actual execution time will significantly reduce. Additionally, it requires unnecessarily combining datasets already classified by sentence into a single string.

Alternatively, we could use method 1 internally by parsing based on the \n symbol.

What do you think about this implementation? If you think it's unnecessary, that's perfectly fine too. :)

@ekaf
Copy link
Contributor

ekaf commented May 23, 2024

Thanks again @Sion1225, your contributions are very welcome!

To accomodate ML, I guess that support for multiprocessing is actually needed, but it could eventually wait for a future PR.

What about implementing your function as a method of WordNetLemmatizer, instead of defining a new class? It could be advantageous to group all the methods that use morphy into one class, like in the following example. I moved the imports inside the function, and simplified the construction of the lemma_list:

    def lemmatize_text(self, text: str) -> List:
        """
        Tokenize input text, estimate the part-of-speech tag of each word,
        and return a list of lemmas.

        Returns each input word unchanged when it cannot be found in WordNet.

        :param text: The input text to lemmatize.
        :type text: str
        :return: A list with the estimated lemma of each `word` in the input text.

        >>> from nltk.stem import WordNetLemmatizer
        >>> wntl = WordNetLemmatizer().lemmatize_text
        >>> print(wntl('Proverbs are short sentences drawn from long experience.'))
        ['Proverbs', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']
        >>> print(wntl('proverbs are short sentences drawn from long experience.'))
        ['proverb', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']

        """
        from nltk.tag import pos_tag
        from nltk.tokenize import word_tokenize

        # Tokenize the text, and POS tag each word
        pos_tags = pos_tag(word_tokenize(text))

        return [
            # Lemmatize each POS-tagged word
            self.lemmatize(pos_tag[0], self.pos_word_dict.get(pos_tag[1], "n"))
            for pos_tag in pos_tags
        ]

@ekaf
Copy link
Contributor

ekaf commented May 23, 2024

It is also possible to shorten the function even further. Some professors don't appreciate this concise style, but it tends to be popular in nltk. In my opinion, concise code is easier to understand, debug and maintain:

    def lemmatize_text(self, text: str) -> List:

        from nltk.tag import pos_tag
        from nltk.tokenize import word_tokenize

        return [
            # Lemmatize each POS-tagged word:
            self.lemmatize(word, self.pos_word_dict.get(tag, "n"))
            # Tokenize the input text and POS tag each word:
            for word, tag in pos_tag(word_tokenize(text))
        ]

For multiprocessing, it might even be preferrable to make it a generator, using yield from instead of return.

@Sion1225
Copy link
Author

Sion1225 commented May 23, 2024

Thank you for your comments and appropriate advice @ekaf !

So, you are suggesting that we implement this new feature first and leave multiprocessing as a milestone for later?

I also contemplated whether implementing it as a method within WordNetLemmatizer or as an independent class would be easier to understand, so I welcome your opinion!
However, for this feature, the pos_word_dict dictionary needs to be declared, and redefining it every time we use this feature would be inefficient. Therefore, I want to declare it as a class variable. But if we implement it as a method within WordNetLemmatizer, there might be an issue where the unnecessary pos_word_dict dictionary gets declared every time we use lemmatize. Do you think this issue can be ignored since the pos_word_dict dictionary is relatively small in size?


Thank you for restructuring the code using list comprehension to make it more efficient without overhead. Regarding code style, I don't have a strong preference, but I will follow your suggestion and modify it according to the NLTK style.

Also, thank you for the concise revisions to the function descriptions in the code! The current example provided is from Don Quijote, but do you have any ideas for additional examples that might be beneficial to include?

@arademaker
Copy link

I also contemplated whether implementing it as a method within WordNetLemmatizer or as an independent class would be easier to understand, so I welcome your opinion!

If your method does not use any information from wordnets, I believe it would be very missleading to implement it a method of the WordNetLemmatizer class. Unfortunately, the WordNetLemmatizer is not a subclass of a general Lemmatizer class, but making a Lemmatizer basic class would be great so WordNetLemmatizer and your class could be subclass of it. On the other hand, better find a more informative name instead of AutoLemmatizer.

@ekaf
Copy link
Contributor

ekaf commented May 23, 2024

@Sion1225, there is no reason to worry about your dictionary,
because it is equivalent to the following shorter helper function, which can be defined inside the lemmatize_text() function:

 def tag2pos(tag):
    return {"N": "n", "V": "v", "J": "a", "R": "r"}.get(tag[0], "n")

But there is a severe weakness in our approach, because it just returns the lemmas, while we also need the Pos if we want to use the output for something useful, like for ex. retrieving WordNet senses.
Hopefully, this could alleviate @arademaker's concerns above.
So I suggest we return (lemma, Pos) tuples, or evt. triples including the tag. Wouldn't that also be useful in your specific use case, where you wrote that you want to lookup the lemmas in a dictionary?

One use case I have in mind is lemmatizing the Brown Corpus, and comparing our output with the official Brown tags. Or what about the Gloss Corpus, @arademaker? I fear that our performance may not be very impressive, because of known issues with the lemmatize function, but it would be interesting to get some statistics.

@Sion1225
Copy link
Author

Sion1225 commented May 24, 2024

@ekaf That's a great method! It's something I hadn't thought of. It's definitely more concise. I'm learning a lot.
I think returning the current output (lemma reflecting POS) and POS together as a tuple (lemma, POS) is also a very good idea. I think it's more useful too. I'll incorporate it into the code.

To merge it as a sub-method of WordNetLemmatizer, I think it will also be necessary to revise the usage instructions and example comments of WordNetLemmatizer.


@arademaker Our new function internally uses the lemmatize method of WordNetLemmatizer, so I think it would be okay to include it as a sub-method of WordNetLemmatizer. The current candidate name is lemmatize_text for method and TextLemmatizer for class, , but if you have any better ideas, welcome to suggest it!

@ekaf
Copy link
Contributor

ekaf commented May 25, 2024

I have started to measure the accuracy of this approach against the gold standard tags in the whole Brown Corpus, but I still need to tweak the tag2pos() function in order to accomodate some non-standard Brown Pos tags, like 'B' for the auxilary "be", and 'H' for "have". Nevertheless, it is already clear that we'll end up with more than 90% accuracy for the Pos tagging.

So it is reasonable to base the lemmatization on the present pos tagging. However, this does not mean that we should prefer the WordNetLemmatizer's lemmatize() function. And I absolutely agree with @arademaker's objection against misusing the WordNet name.
@Sion1225's proposal could be extended to loop over any other lemmatize function, and there is also certainly room for developing new and better lemmatizers from scratch. But those goals are more ambitious, and better kept outside the scope of this PR.

@ekaf
Copy link
Contributor

ekaf commented May 25, 2024

@Sion1225, this PR should be converted to draft (using the link in the upper right corner), as long as it is "Still in progress", and not ready for merging.

@Sion1225 Sion1225 marked this pull request as draft May 25, 2024 08:42
@Sion1225
Copy link
Author

Sion1225 commented May 26, 2024

@ekaf
I pushed code with changes! and moved examples from WordNetLemmatizer class to lemmatize because we have another method now.

@Sion1225 Sion1225 requested a review from ekaf May 26, 2024 17:10
Copy link
Contributor

@ekaf ekaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sion1225, please note that your old doctests fail, now that the output is tuples that include the tags. This will become important when finalizing this PR, because doctests are required to succeed before merging. But you don't need to worry about it now, because you might also want to use the tuple2str function from nltk.tag to present the output in a prettier form in the doctest.

Would you consider calling the pos_tag function using "universal" as the value of the optional tagset parameter? The Universal Tagset is very convenient, because the nltk.tag module provides a map_tag function that can map many other tagsets to Universal Tags. This is handy for my current evaluation against the Brown tagset, and probably many other uses. So it would be more useful to return the Universal Tag instead of the WordNet pos.
If you want, I could push a commit, replacing tag2pos with this helper function:

def universal_tag_to_wn_pos(tag):
    """Convert Universal Tag to WordNet Pos.
    Return None when WordNet does not cover the Pos"""
    return {"NOUN": "n", "VERB": "v", "ADJ": "a", "ADV": "r"}.get(tag, None)

@Sion1225
Copy link
Author

Sion1225 commented May 27, 2024

@ekaf Yes, it definitely seems that the Universal Tagset is more useful. I would be happy if you could push the commit! Also, as you mention, I forgot to change the output example in the method comments. If the output changes again, I think that part will need to be revised as well. Would you please take care of that too?

@ekaf
Copy link
Contributor

ekaf commented May 28, 2024

Ok, @Sion1225, I moved the universal_tag_to_wn_pos() out of the class, because it can also be useful elsewhere. Maybe it even should go into the corpora/readers/wordnet.py module instead?
Using Universal Tags, the lemmatize_text() function achieves 92.25% Pos accuracy on the whole Brown Corpus, after discarding the sentences where the NLTK tokenization produces a different number of tokens than Brown's. These are not errors: the NLTK tokenizer tends to be more precise than Brown's, but those sentences are hard to compare when they don't align. This tokenization difference concerns 1.35% of the total number of tokens.

@Sion1225
Copy link
Author

@ekaf I reviewed the code. Thank you for your modifications and push!
I don't have a deep understanding of the NLTK library's structure, but I think that as an independent function, it might fit better in a different module.
Is the tokenized Brown corpus provided by NLTK different from the result of tokenizing the Brown corpus using word_tokenize? It's an interesting analysis, and I'm curious about the remaining 7.75% difference.

@ekaf
Copy link
Contributor

ekaf commented May 28, 2024

Here is a typical example of my evaluation of the Pos tagging, covering the first three sentences from the Brown Corpus: out-brown.txt.
In each line, the first tuple is the original Brown (token, tag), then the corresponding Universal Tag, and then the output from text_lemmatize()
Sentence (1) is discarded because the NLTK tokenizer splits Atlanta's in two tokens, so it doesn't align with Brown, where it is only one token.
Sententence (2) shows two Pos differences for (term-end and Executive).
Sentence (3) has one difference for Superior.

These examples are typical, because the Pos differences are often between NOUN and ADJ.

@Sion1225
Copy link
Author

Thank @ekaf for examples to understand. I realised this is because some words can have multiple parts of speech, such as both nouns and adjectives.
Regarding the usage example in the method's comments, I believe that it would be more intuitive to output the tuples directly rather than converting them to strings with the tuple2str function. This would make it easier to understand the function's behavior. What do you think?

@ekaf
Copy link
Contributor

ekaf commented May 28, 2024

Yes @Sion1225, the problem is that the most frequent words are often ambiguous. The wn._morphy() function returns a list with all matches, but among those, lemmatize() picks the shortest word.
That's an arbitrary choice, which is not always the best. More linguistically ambitious applications consider all choices.

I don't mind if you prefer to present tuples in the docstring, just feel free to make any adjustment you wish.

@Sion1225
Copy link
Author

I modifed comments and I think now it's readied for pull request.

And, @ekaf I thought about universal_tag_to_wn_pos() it's suitable into wordnet.py imo.

@Sion1225 Sion1225 marked this pull request as ready for review May 29, 2024 16:17
@ekaf
Copy link
Contributor

ekaf commented May 29, 2024

Thanks @Sion1225! Yes, it seems ready now.

@Sion1225 Sion1225 changed the title Develop auto lemmatize function Develop text lemmatize function May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants