### Tokenization
1. Converting sentence -> Paragraph

In [6]:
corpus = """Hello World, Welcome to Nivee's AI. 
Let's learn AI and grow together! and become an expert in AI"""

In [7]:
from nltk.tokenize import sent_tokenize

In [8]:
documents = sent_tokenize(corpus)

In [None]:
documents

["Hello World, Welcome to Nivee's AI.",
 "Let's learn AI and grow together!",
 'and become an expert in AI']

In [None]:
for d in documents:
    print(d)

Hello World, Welcome to Nivee's AI.
Let's learn AI and grow together!
and become an expert in AI


In [None]:
len(documents)

3

2. Paragraph to Words / Sentence -> Words

In [9]:
from nltk.tokenize import word_tokenize

In [10]:
word_tokenize(corpus)

['Hello',
 'World',
 ',',
 'Welcome',
 'to',
 'Nivee',
 "'s",
 'AI',
 '.',
 'Let',
 "'s",
 'learn',
 'AI',
 'and',
 'grow',
 'together',
 '!',
 'and',
 'become',
 'an',
 'expert',
 'in',
 'AI']

Getting each sentence and splitting it as words line by line

In [11]:
for doc in documents:
    print(word_tokenize(doc))

['Hello', 'World', ',', 'Welcome', 'to', 'Nivee', "'s", 'AI', '.']
['Let', "'s", 'learn', 'AI', 'and', 'grow', 'together', '!']
['and', 'become', 'an', 'expert', 'in', 'AI']


Separarting punctuations

In [12]:
from nltk.tokenize import wordpunct_tokenize

In [13]:
for doc in documents:
    print(wordpunct_tokenize(doc))

['Hello', 'World', ',', 'Welcome', 'to', 'Nivee', "'", 's', 'AI', '.']
['Let', "'", 's', 'learn', 'AI', 'and', 'grow', 'together', '!']
['and', 'become', 'an', 'expert', 'in', 'AI']


Treebank tokenizer will not treat fullstop as separate word except the last full stop

In [14]:
from nltk.tokenize import TreebankWordTokenizer

In [15]:
documents

["Hello World, Welcome to Nivee's AI.",
 "Let's learn AI and grow together!",
 'and become an expert in AI']

In [16]:
tokens = TreebankWordTokenizer()
tokens.tokenize(corpus)

['Hello',
 'World',
 ',',
 'Welcome',
 'to',
 'Nivee',
 "'s",
 'AI.',
 'Let',
 "'s",
 'learn',
 'AI',
 'and',
 'grow',
 'together',
 '!',
 'and',
 'become',
 'an',
 'expert',
 'in',
 'AI']

### Stemming

Stemming is the process of reducing the word to its word stem

In [20]:
words = ['eating', 'eats', 'eaten', 'writing', 'writes', 'programming', 'programs', 'history', 'finally', 'finalized']

1. Porter Stemmer technique

In [17]:
from nltk.stem import PorterStemmer

In [18]:
stemming = PorterStemmer()

In [21]:
for word in words:
    print(word+'->>>>>>>>>>'+stemming.stem(word))

eating->>>>>>>>>>eat
eats->>>>>>>>>>eat
eaten->>>>>>>>>>eaten
writing->>>>>>>>>>write
writes->>>>>>>>>>write
programming->>>>>>>>>>program
programs->>>>>>>>>>program
history->>>>>>>>>>histori
finally->>>>>>>>>>final
finalized->>>>>>>>>>final


Disadvantage of Stemming: some words doesnt give proper meaning i.e History is Histori

In [22]:
stemming.stem('Congratulations')

'congratul'

### Regex Stemmer Class

We can implement regex stemmer algorithms

In [23]:
from nltk.stem import RegexpStemmer

In [25]:
st = RegexpStemmer('ing$|s$|e$|able$', min=4)
    

In [26]:
st.stem('eating')

'eat'

In [27]:
st.stem('ingeating')

'ingeat'

### Snowball Stemmer

In [28]:
from nltk.stem import SnowballStemmer

In [30]:
snowballstemmer = SnowballStemmer(language='english')

In [36]:
for word in words:
    print(word + ' :'+snowballstemmer.stem(word))

eating :eat
eats :eat
eaten :eaten
writing :write
writes :write
programming :program
programs :program
history :histori
finally :final
finalized :final


In [37]:
stemming.stem('sportingly'), stemming.stem('fairly')

('sportingli', 'fairli')

In [39]:
snowballstemmer.stem('sportingly'), snowballstemmer.stem('fairly')

('sport', 'fair')

In [40]:
snowballstemmer.stem('goes')

'goe'