Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop text lemmatize function #3257

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AUTHORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,7 @@
- Akihiro Yamazaki <https://github.com/zakkie>
- Ron Urbach <https://github.com/sharpblade4>
- Vivek Kalyan <https://github.com/vivekkalyan>
- Siwon Seo <https://github.com/Sion1225>

## Others whose work we've taken and included in NLTK, but who didn't directly contribute it:

Expand Down
62 changes: 48 additions & 14 deletions nltk/stem/wordnet.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Natural Language Toolkit: WordNet stemmer interface
#
# Copyright (C) 2001-2023 NLTK Project
# Copyright (C) 2001-2024 NLTK Project
# Author: Steven Bird <stevenbird1@gmail.com>
# Edward Loper <edloper@gmail.com>
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT

from typing import List

from nltk.corpus import wordnet as wn


Expand All @@ -15,19 +17,6 @@ class WordNetLemmatizer:

Lemmatize using WordNet's built-in morphy function.
Returns the input word unchanged if it cannot be found in WordNet.

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
>>> print(wnl.lemmatize('churches'))
church
>>> print(wnl.lemmatize('aardwolves'))
aardwolf
>>> print(wnl.lemmatize('abaci'))
abacus
>>> print(wnl.lemmatize('hardrock'))
hardrock
"""

def lemmatize(self, word: str, pos: str = "n") -> str:
Expand All @@ -41,9 +30,54 @@ def lemmatize(self, word: str, pos: str = "n") -> str:
for satellite adjectives.
:type pos: str
:return: The lemma of `word`, for the given `pos`.

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
>>> print(wnl.lemmatize('churches'))
church
>>> print(wnl.lemmatize('aardwolves'))
aardwolf
>>> print(wnl.lemmatize('abaci'))
abacus
>>> print(wnl.lemmatize('hardrock'))
hardrock
"""
lemmas = wn._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word

def lemmatize_text(self, text: str) -> List[str]:
"""
Tokenize input text, estimate the part-of-speech tag of each word,
and return a list of lemmas and pos tag.

Returns each input word unchanged when it cannot be found in WordNet.

:param text: The input text to lemmatize.
:type text: str
:return: A list with the estimated lemma and pos tag of each `word` in the input text.

>>> from nltk.stem import WordNetLemmatizer
>>> wntl = WordNetLemmatizer().lemmatize_text
>>> print(wntl('Proverbs are short sentences drawn from long experience.'))
['Proverbs', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']
>>> print(wntl('proverbs are short sentences drawn from long experience.'))
['proverb', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']
"""
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

return [
# Lemmatize each POS-tagged word:
(self.lemmatize(word, self.tag2pos(tag)), tag)
# Tokenize the input text and POS tag each word:
for word, tag in pos_tag(word_tokenize(text))
]

@staticmethod
def tag2pos(tag):
return {"N": "n", "V": "v", "J": "a", "R": "r"}.get(tag[0], "n")

def __repr__(self):
return "<WordNetLemmatizer>"