## Stemming

    - Stemming is the preocess of removing a part of word or reducing a word to its stem or root.
    - A stemming algorithm reduces the word.
    
    For Example : "chocolates", "chocolaty", "choco"   
      
    When you apply the stemming concept the words are reduced to its root/stem (in the above example the word "choco" will
    be the root word).


### Over Stemming:

    Over stemming is the process where a much larger part of word is chopped off (removed) than what is required, which in
    turn leads to two or more words being reduced to the same root word or stem incorrectly when they should have been
    reduced to two or more stem words.

    For Example: "universal", "university", "universe" ==> "univers"


### Under Stemming:

    Under stemming is when two words that should be stemmed to the same root are not being done. This is also known as
    false negative.
    
    For Example: "alumnus", "alumni", "alumnae"

### Porter Stemmer Algorithm

In [1]:
from nltk.stem import PorterStemmer

In [2]:
from nltk.tokenize import word_tokenize

In [3]:
ps = PorterStemmer()

In [4]:
word = ["program", "programmer", "programs", "programming", "programmers"]

In [5]:
for item in word:
    print(item," ", ps.stem(item))

program   program
programmer   programm
programs   program
programming   program
programmers   programm


In [7]:
sentence = "Provision Maximum multiple owned caring on go gone going was this"

In [8]:
tk_list = word_tokenize(sentence)

In [9]:
tk_list

['Provision',
 'Maximum',
 'multiple',
 'owned',
 'caring',
 'on',
 'go',
 'gone',
 'going',
 'was',
 'this']

In [10]:
for word in tk_list:
    print(word, " ", ps.stem(word))

Provision   provis
Maximum   maximum
multiple   multipl
owned   own
caring   care
on   on
go   go
gone   gone
going   go
was   wa
this   thi


In [11]:
words = ['generous','generate','generously','generation']
for word in words:
    print(word," ",ps.stem(word))

generous   gener
generate   gener
generously   gener
generation   gener


### Snowball Stemmer Algorithm

In [12]:
from nltk.stem import SnowballStemmer

In [14]:
snowball = SnowballStemmer(language="german")

In [23]:
word = ["program", "programmer", "programs", "programming", "programmers"]

In [24]:
for item in word:
    print(item," ", snowball.stem(item))

program   program
programmer   programm
programs   program
programming   programming
programmers   programm


In [26]:
for item in tk_list:
    print(item," ", snowball.stem(item))

Provision   provision
Maximum   maximum
multiple   multipl
owned   owned
caring   caring
on   on
go   go
gone   gon
going   going
was   was
this   this


In [27]:
for item in words:
    print(item," ", snowball.stem(item))

generous   generous
generate   generat
generously   generously
generation   generation


### Lancaster Stemmer Algorithm

    Lancaster stemmer is simple but it tends to produce results with over stemming Over stemming causes the stem to be non
    meaningful.

In [28]:
from nltk.stem import LancasterStemmer

In [29]:
lancaster = LancasterStemmer()

In [30]:
for item in word:
    print(item," ", lancaster.stem(item))

program   program
programmer   program
programs   program
programming   program
programmers   program


In [31]:
for item in tk_list:
    print(item," ", lancaster.stem(item))

Provision   provid
Maximum   maxim
multiple   multipl
owned   own
caring   car
on   on
go   go
gone   gon
going   going
was   was
this   thi


In [32]:
for item in words:
    print(item," ", lancaster.stem(item))

generous   gen
generate   gen
generously   gen
generation   gen


## Lemmatization

    - Lemmatization is the process of converting a word to its base form.
    - The difference between Stemming and Lemmatization is lemmatization considers the context & converts the word into a
      meaningful base form whereas stemming just removes the last few characters often leading to incorrect & spelling
      errors.
    
    For Example:-   'Caring' => Lemmatization => 'Care'
                    'Caring' => Stemming => 'Car'

**Some lemmatization libraries**

    - word lemmatizer,
    - spacy lemmatizer,
    - textblob,
    - clip pattern,
    - stanford coreNLP,
    - Genism Lemmatizer,
    - TreeTagger

In [33]:
import nltk

In [35]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Abhi
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [36]:
from nltk.stem import WordNetLemmatizer

In [40]:
lemmatizer = WordNetLemmatizer()

In [41]:
words = ['bats', 'are', 'feet', 'hands']

In [42]:
# lemmatization

for item in words:
    print(item," ",lemmatizer.lemmatize(item))

bats   bat
are   are
feet   foot
hands   hand


In [43]:
# lancaster stemming

for item in words:
    print(item," ",lancaster.stem(item))

bats   bat
are   ar
feet   feet
hands   hand


In [44]:
# snowball stemming

for item in words:
    print(item," ",snowball.stem(item))

bats   bat
are   are
feet   feet
hands   hand


In [45]:
# porter stemming

for item in words:
    print(item," ",ps.stem(item))

bats   bat
are   are
feet   feet
hands   hand


### Sentence Lemmatization

In [46]:
sentence = "The striped bats are hanging on their feet for best"

In [47]:
word_list = nltk.word_tokenize(sentence)

In [48]:
word_list

['The',
 'striped',
 'bats',
 'are',
 'hanging',
 'on',
 'their',
 'feet',
 'for',
 'best']

In [51]:
lemmatized_output = ' '.join([lemmatizer.lemmatize(item) for item in word_list])

In [52]:
print(lemmatized_output)

The striped bat are hanging on their foot for best


In [53]:
print(sentence)

The striped bats are hanging on their feet for best


In [54]:
# for converting the lemmatize word into verb (POS)

print(lemmatizer.lemmatize('stripes','v'))

strip


In [55]:
# for converting the lemmatize word into noun (POS)

print(lemmatizer.lemmatize('stripes','n'))

stripe


In [56]:
print("can't",lemmatizer.lemmatize("can't"))
print("what's",lemmatizer.lemmatize("what's"))
print("couldn't",lemmatizer.lemmatize("couldn't"))
print("wasn't",lemmatizer.lemmatize("wasn't"))

can't can't
what's what's
couldn't couldn't
wasn't wasn't
