# TEXT PREPROCESSING - LEMMATIZATION

**Lemmatization** is text preprocessing techinique where the tokens generated from the corpus are reduced to their dictionary form a.k.a lemma. The lemmas are meaningful words, canonical forms of the words. This makes processing more complex and slower than stemming. This is preferred when the problem needs meaningful words as input (text generation) than just a symbolic represtation (text classification - stemming is enough for this).

Lemmatization involves Part of Speech Tagging for better word to lemma conversion.
Harder to make a lemmatizer for new language (than a stemmer)
NLTK Lemmatizer is based on **WordNet database** - like a thesaurus.

**StopWord Removal** is still a necessity to reduce less semantically important words.

In [3]:
#will be using NLTK to demonstrate lemmatizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

paragraph = """Paragraphs are the building blocks of papers. Many students define paragraphs \
in terms of length. A paragraph is a group of at least five sentences. Paragraph \
is half a page long, etc."""

In [4]:
#Check the list of stopwords in english language
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
#generate sentences from the paragraph
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['Paragraphs are the building blocks of papers.', 'Many students define paragraphs in terms of length.', 'A paragraph is a group of at least five sentences.', 'Paragraph is half a page long, etc.']


In [6]:
#initalise stemmer and stem each word, remove stopwords
lemmatizer = WordNetLemmatizer()
lemmatized_sentences = []

for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print("Words before lemmatization : ", words)
    
    lemmas = []
    for word in words:
        if word not in set(stopwords.words('english')):
            lemma = lemmatizer.lemmatize(word)
            lemmas.append(lemma)
    
    lemmatized_sentence = ' '.join(lemmas)
    lemmatized_sentences.append(lemmatized_sentence)
        
    print("Words after lemmatizaton : ", lemmas)
    
print("Sentences after lemmatizaton : ", lemmatized_sentences)           

Words before lemmatization :  ['Paragraphs', 'are', 'the', 'building', 'blocks', 'of', 'papers', '.']
Words after lemmatizaton :  ['Paragraphs', 'building', 'block', 'paper', '.']
Words before lemmatization :  ['Many', 'students', 'define', 'paragraphs', 'in', 'terms', 'of', 'length', '.']
Words after lemmatizaton :  ['Many', 'student', 'define', 'paragraph', 'term', 'length', '.']
Words before lemmatization :  ['A', 'paragraph', 'is', 'a', 'group', 'of', 'at', 'least', 'five', 'sentences', '.']
Words after lemmatizaton :  ['A', 'paragraph', 'group', 'least', 'five', 'sentence', '.']
Words before lemmatization :  ['Paragraph', 'is', 'half', 'a', 'page', 'long', ',', 'etc', '.']
Words after lemmatizaton :  ['Paragraph', 'half', 'page', 'long', ',', 'etc', '.']
Sentences after lemmatizaton :  ['Paragraphs building block paper .', 'Many student define paragraph term length .', 'A paragraph group least five sentence .', 'Paragraph half page long , etc .']


**Notes :**

* Lemmatization solves the issues of Stemming but we pay it through complexity and time