---
<strong>
    <h1 align='center'><strong>Lemmatization</strong></h1>
</strong>

---

**Stemming:**

- **Stemming** is a text normalization technique that aims to reduce words to their root or stem form by removing `suffixes` or `prefixes`.

- It operates through a set of heuristic rules and does not consider the meaning of words. As a result, stemming can sometimes produce root forms that are not actual words.

- For example:
  - `"jumping" → jump"`
  - `"flies" → "fli"`

**Lemmatization:**

- **Lemmatization** is a text normalization technique that aims to reduce words to their base or root form by **considering the word's meaning and context**.

- It converts words to their `lemma` or `dictionary form`, which is a valid word.

- **Lemmatization** typically involves looking up a word in a `lexical resource` (like a dictionary) to find its base form.

- For example:
  - `"jumping" → "jump"`
  - `"flies" → "fly"`


In summary, while both **stemming** and **lemmatization** are used for text normalization in NLP.

- **lemmatization** is a more sophisticated technique that considers the meaning and context of words, resulting in more accurate base forms.

- Stemming, on the other hand, uses heuristic rules and may produce root forms that are not actual words.

The choice between `stemming` and `lemmatization` depends on the specific requirements of the NLP task and the level of linguistic accuracy needed.

Lemmatization is often preferred when working with tasks that require a deeper understanding of language, such as information retrieval or machine translation, while stemming may be suitable for simpler tasks like text classification or search engine indexing.

In [None]:
from pprint import pprint

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download a specific NLTK dataset, e.g., the 'punkt' tokenizer models.
nltk.download('punkt', quiet=True)

# Download the NLTK stopwords dataset, which contains common stopwords for various languages.
nltk.download('stopwords', quiet=True)

# Download the NLTK averaged perceptron tagger, which is used for part-of-speech tagging.
nltk.download('averaged_perceptron_tagger', quiet=True)

# Download the WordNet lexical database, which is used for various NLP tasks like synonym and antonym lookup.
nltk.download('wordnet', quiet=True)

# Download the NLTK names dataset, which contains a list of common first names and last names.
# nltk.download('names', quiet=True)

# Download the NLTK movie_reviews dataset, which contains movie reviews categorized as positive and negative.
# nltk.download('movie_reviews', quiet=True)

# Download the NLTK reuters dataset, which is a collection of news documents categorized into topics.
# nltk.download('reuters', quiet=True)

# Download the NLTK brown corpus, which is a collection of text from various genres of written American English.
# nltk.download('brown', quiet=True)

True

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

# Sample sentence
sentence = "The cats were chasing the mice, but they couldn't catch them."

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
words = nltk.word_tokenize(sentence)

# Lemmatize each word in the sentence
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Join the lemmatized words back into a sentence
lemmatized_sentence = " ".join(lemmatized_words)

# Print the original and lemmatized sentences
print("Original Sentence:", sentence)
print("Lemmatized Sentence:", lemmatized_sentence)

Original Sentence: The cats were chasing the mice, but they couldn't catch them.
Lemmatized Sentence: The cat were chasing the mouse , but they could n't catch them .


```python
my_list = ["apple", "banana", "cherry"]
separator = ", "
result = separator.join(my_list)
print(result)

Output:
-----------------------
apple, banana, cherry
-----------------------

```

The reason `"chasing"` is not changed to `"chase"` in the lemmatized sentence is because the `WordNetLemmatizer` **in NLTK, by default, does not perform lemmatization on verbs** if they are in their `present participle (-ing)` form.

Instead, it preserves them as they are. **This is because lemmatization aims to reduce words to their base or dictionary form, and for verbs, the base form is typically the infinitive form**.

If you want to lemmatize verbs into their base form, you can specify the part of speech (POS) tag when calling the lemmatize method.

In this case, you can tag the words in the sentence with their corresponding POS before lemmatization.

**Here's how you can modify your code to achieve this:**

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

# Sample sentence
sentence = "The cats were chasing the mice, but they couldn't catch them."

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
words = nltk.word_tokenize(sentence)

# Tag the words with their POS (Part of Speech)
tagged_words = nltk.pos_tag(words)

# Define a function to map POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return 'a'  # Adjective
    elif treebank_tag.startswith('V'):
        return 'v'  # Verb
    elif treebank_tag.startswith('N'):
        return 'n'  # Noun
    elif treebank_tag.startswith('R'):
        return 'r'  # Adverb
    else:
        return 'n'  # Default to noun if POS tag is not found

# Lemmatize each word in the sentence based on its POS
lemmatized_words = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(pos_tag)) for word, pos_tag in tagged_words]

# Join the lemmatized words back into a sentence
lemmatized_sentence = " ".join(lemmatized_words)

# Print the original and lemmatized sentences
print("Original Sentence:", sentence)
print("Lemmatized Sentence:", lemmatized_sentence)

Original Sentence: The cats were chasing the mice, but they couldn't catch them.
Lemmatized Sentence: The cat be chase the mouse , but they could n't catch them .


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)

# Sample text corpus
corpus = [
    "The quick brown foxes are jumping over the lazy dogs.",
    "Sheep are grazing peacefully in the meadow.",
    "I have been studying NLTK for natural language processing.",
    "Lemmatization helps reduce words to their base form."
]

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the corpus and perform lemmatization
lemmatized_corpus = []

for document in corpus:
    words = nltk.word_tokenize(document)
    # Remove stopwords and perform lemmatization
    filtered_words = [lemmatizer.lemmatize(word) for word in words if word.lower() not in stopwords.words('english')]
    lemmatized_corpus.append(" ".join(filtered_words))

# Print the lemmatized corpus
for i, document in enumerate(lemmatized_corpus):
    print(f"Document {i + 1}: {document}")


Document 1: quick brown fox jumping lazy dog .
Document 2: Sheep grazing peacefully meadow .
Document 3: studying NLTK natural language processing .
Document 4: Lemmatization help reduce word base form .


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

paragraph = """Thank you to all of you in this room. I have to congratulate
               the other incredible nominees this year. The Revenant was
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom
               Hardy. Tom, your talent on screen can only be surpassed by
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at
               Fox and New Regency … my entire team. I have to thank
               everyone from the very onset of my career … To my parents;
               none of this would be possible without you. And to my
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2023 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this
               amazing award tonight. Let us not take this planet for
               granted. I do not take tonight for granted. Thank you so very much."""


sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

pprint(sentences)

['Thank room .',
 'I congratulate incredible nominee year .',
 'The Revenant product tireless effort unbelievable cast crew .',
 'First , brother endeavor , Mr. Tom Hardy .',
 'Tom , talent screen surpassed friendship screen … thank creating ranscendent '
 'cinematic experience .',
 'Thank everybody Fox New Regency … entire team .',
 'I thank everyone onset career … To parent ; none would possible without .',
 'And friend , I love dearly ; know .',
 "And lastly , I want say : Making The Revenant man 's relationship natural "
 'world .',
 'A world collectively felt 2023 hottest year recorded history .',
 'Our production needed move southern tip planet able find snow .',
 'Climate change real , happening right .',
 'It urgent threat facing entire specie , need work collectively together stop '
 'procrastinating .',
 'We need support leader around world speak big polluter , speak humanity , '
 'indigenous people world , billion billion underprivileged people would '
 'affected .',
 'For c