<a href="https://colab.research.google.com/github/kefasmanu/NLP-FELLOWSHIP/blob/main/week%202/Stemmer%26Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stemming
*A stemming algorithm, a procedure to reduce all words with the same
stem to a common form, is useful in many areas of computational linguistics and information-retrieval work.* ~ Lovin,1968

Examples of Stemmers include:


1.   PorterStemmer
2.   SnowballStemmer
3. LancasterStemmer
4. RegexStemmer



In [1]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## PorterStemmer
Designed and built by Martin Porter in 1980

Takes Five steps each with its own mapping rules. Easy and fast


In [2]:

from nltk.stem import PorterStemmer, SnowballStemmer,LancasterStemmer,RegexpStemmer,WordNetLemmatizer
words = ["friend", "friendship", "friends", "friendships","generate","generates","generating","general","generally","generic","generically","generous","generously","went","ate"]
Porter = PorterStemmer()

for word in words:
    print(word,"--->",Porter.stem(word))

friend ---> friend
friendship ---> friendship
friends ---> friend
friendships ---> friendship
generate ---> gener
generates ---> gener
generating ---> gener
general ---> gener
generally ---> gener
generic ---> gener
generically ---> gener
generous ---> gener
generously ---> gener
went ---> went
ate ---> ate


In [3]:
new_words = ["walk","walking","walked","walks"]
for word in new_words:
    print(word,"--->",Porter.stem(word))

walk ---> walk
walking ---> walk
walked ---> walk
walks ---> walk


## SnowballStemmer/Porter2Stemmer
Designed and built by Martin Porter
Advancement of PorterStemmer

Faster and more precise than Porter Stemmer

In [4]:
snowball = SnowballStemmer(language='english')

for word in words:
    print(word,"--->",snowball.stem(word))

friend ---> friend
friendship ---> friendship
friends ---> friend
friendships ---> friendship
generate ---> generat
generates ---> generat
generating ---> generat
general ---> general
generally ---> general
generic ---> generic
generically ---> generic
generous ---> generous
generously ---> generous
went ---> went
ate ---> ate


## LancasterStemmer
Simpler

Results to over stemming of words, which leads to meaningless words

In [5]:
lancaster = LancasterStemmer()

for word in words:
    print(word,"--->",lancaster.stem(word))

friend ---> friend
friendship ---> friend
friends ---> friend
friendships ---> friend
generate ---> gen
generates ---> gen
generating ---> gen
general ---> gen
generally ---> gen
generic ---> gen
generically ---> gen
generous ---> gen
generously ---> gen
went ---> went
ate ---> at


## RegexStemmer
Uses regex

Substring matching the regex will be discarded

Worst performer

In [6]:
regex = RegexpStemmer('ing$|s$|e$|able$|lly$|ate$', min=3)

for word in words:
    print(word,"--->",regex.stem(word))

friend ---> friend
friendship ---> friendship
friends ---> friend
friendships ---> friendship
generate ---> gener
generates ---> generate
generating ---> generat
general ---> general
generally ---> genera
generic ---> generic
generically ---> generica
generous ---> generou
generously ---> generously
went ---> went
ate ---> 


# Lemmantizing

In [7]:
wordnet = WordNetLemmatizer()
lemm_word = ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
for word in lemm_word:
    print(word,"--->",wordnet.lemmatize(word))

The ---> The
striped ---> striped
bats ---> bat
are ---> are
hanging ---> hanging
on ---> on
their ---> their
feet ---> foot
for ---> for
best ---> best


In [8]:
from gensim.utils import tokenize

In [9]:
from nltk import  pos_tag
text = '''President Paul Kagame has said that deliberate efforts are needed to forge private-public partnerships to bridge internet usage gaps. He was speaking during the inauguration of the Mobile World Congress 2022 which convened more than 2000 people representing 99 countries, on October 25.
Global mobile operators, device manufacturers, technology providers, vendors, content owners, and policymakers are in Kigali to identify gaps and discuss effective measures needed to drive digital transformation in Africa. To address the usage gap –the number of people who can’t use mobile internet services while living in an area covered by broadband networks –Kagame said that neither the private nor the public sector has all that is required to cover the gap, hence, the need for partnerships. '''

tokens = list(tokenize(text))
tokens

['President',
 'Paul',
 'Kagame',
 'has',
 'said',
 'that',
 'deliberate',
 'efforts',
 'are',
 'needed',
 'to',
 'forge',
 'private',
 'public',
 'partnerships',
 'to',
 'bridge',
 'internet',
 'usage',
 'gaps',
 'He',
 'was',
 'speaking',
 'during',
 'the',
 'inauguration',
 'of',
 'the',
 'Mobile',
 'World',
 'Congress',
 'which',
 'convened',
 'more',
 'than',
 'people',
 'representing',
 'countries',
 'on',
 'October',
 'Global',
 'mobile',
 'operators',
 'device',
 'manufacturers',
 'technology',
 'providers',
 'vendors',
 'content',
 'owners',
 'and',
 'policymakers',
 'are',
 'in',
 'Kigali',
 'to',
 'identify',
 'gaps',
 'and',
 'discuss',
 'effective',
 'measures',
 'needed',
 'to',
 'drive',
 'digital',
 'transformation',
 'in',
 'Africa',
 'To',
 'address',
 'the',
 'usage',
 'gap',
 'the',
 'number',
 'of',
 'people',
 'who',
 'can',
 't',
 'use',
 'mobile',
 'internet',
 'services',
 'while',
 'living',
 'in',
 'an',
 'area',
 'covered',
 'by',
 'broadband',
 'networks',


In [13]:
pos_list = pos_tag(tokens)
pos_list

[('President', 'NNP'),
 ('Paul', 'NNP'),
 ('Kagame', 'NNP'),
 ('has', 'VBZ'),
 ('said', 'VBD'),
 ('that', 'IN'),
 ('deliberate', 'NN'),
 ('efforts', 'NNS'),
 ('are', 'VBP'),
 ('needed', 'VBN'),
 ('to', 'TO'),
 ('forge', 'VB'),
 ('private', 'JJ'),
 ('public', 'JJ'),
 ('partnerships', 'NNS'),
 ('to', 'TO'),
 ('bridge', 'VB'),
 ('internet', 'JJ'),
 ('usage', 'NN'),
 ('gaps', 'NNS'),
 ('He', 'PRP'),
 ('was', 'VBD'),
 ('speaking', 'VBG'),
 ('during', 'IN'),
 ('the', 'DT'),
 ('inauguration', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Mobile', 'NNP'),
 ('World', 'NNP'),
 ('Congress', 'NNP'),
 ('which', 'WDT'),
 ('convened', 'VBD'),
 ('more', 'RBR'),
 ('than', 'IN'),
 ('people', 'NNS'),
 ('representing', 'VBG'),
 ('countries', 'NNS'),
 ('on', 'IN'),
 ('October', 'NNP'),
 ('Global', 'NNP'),
 ('mobile', 'NN'),
 ('operators', 'NNS'),
 ('device', 'VBP'),
 ('manufacturers', 'NNS'),
 ('technology', 'NN'),
 ('providers', 'NNS'),
 ('vendors', 'NNS'),
 ('content', 'NN'),
 ('owners', 'NNS'),
 ('and', 'CC'

In [16]:
matched_tags = {'NNP':'n',"VBP":'v'}
processed_tag = []
for token, tag in pos_list:
  token = wordnet.lemmatize(token,matched_tags[tag])
  processed_tag.append(token)
  #print(token,'-------------',tag)

KeyError: ignored

In [17]:
#print(wordnet.lemmatize('countries'))
#pos_tag(['best'])
print(wordnet.lemmatize('better','a'))

good


# Stopwords
Common simple words that add little value

The goal is to reduce the size of the matrix as much as possible, therefore removing common words that do not add value makes sense. An example is I, a, an


In [18]:
from nltk.corpus import stopwords
sw = stopwords.words('english')
print(sw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [19]:
len(tokens)

tokens_no_sw = []
for token in tokens:
  if token not in sw:
    tokens_no_sw.append(token)

print(len(tokens_no_sw))

75


# In-class practicls
1. How many stop words are in NLTK, Spacy,Gensim. Compare them an select one
2. Lemmantize the above text using a for loop
3. Compare the Stemmers, get the best and compare in with Lemmantizer. 
4. Remove stop words from the text 

# Assignment
Create a function that takes the tokens, normalize the tokens and remove the stop words  

In [None]:

def stemming_lem_sw (tokens):
  new_tokens = []
  for token in tokens:
    token = snowball.stem(token)
    if token not in sw:
      new_tokens.append(token)

  return new_tokens

