# NLP - Pre-processing

Corpus -> Paragraph

Documents -> Sentences

Vocabulary -> Unique Words

## Tokenization

It is the fundamental process of breaking down a text into smaller, manageable units called `tokens`.

In [1]:
import nltk

from nltk.tokenize import (
    sent_tokenize,
    word_tokenize,
    wordpunct_tokenize,
    TreebankWordTokenizer
)

In [2]:
text = """There are multiple ways we can perform cost's tokenization on given text data.
We can choose any method based on language, library and purpose of modeling.
"""

### sent_tokenize

In [3]:
sentences = sent_tokenize(text=text, language='english')
sentences

["There are multiple ways we can perform cost's tokenization on given text data.",
 'We can choose any method based on language, library and purpose of modeling.']

### word_tokenize

In [4]:
words = word_tokenize(text=text, language='english')
words

['There',
 'are',
 'multiple',
 'ways',
 'we',
 'can',
 'perform',
 'cost',
 "'s",
 'tokenization',
 'on',
 'given',
 'text',
 'data',
 '.',
 'We',
 'can',
 'choose',
 'any',
 'method',
 'based',
 'on',
 'language',
 ',',
 'library',
 'and',
 'purpose',
 'of',
 'modeling',
 '.']

### wordpunct_tokenize

In [5]:
nltk.download('punkt')

wordpunct = wordpunct_tokenize(text=text)
wordpunct

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['There',
 'are',
 'multiple',
 'ways',
 'we',
 'can',
 'perform',
 'cost',
 "'",
 's',
 'tokenization',
 'on',
 'given',
 'text',
 'data',
 '.',
 'We',
 'can',
 'choose',
 'any',
 'method',
 'based',
 'on',
 'language',
 ',',
 'library',
 'and',
 'purpose',
 'of',
 'modeling',
 '.']

### TreebankWordTokenizer

In [6]:
treebank = TreebankWordTokenizer().tokenize(text)
treebank

['There',
 'are',
 'multiple',
 'ways',
 'we',
 'can',
 'perform',
 'cost',
 "'s",
 'tokenization',
 'on',
 'given',
 'text',
 'data.',
 'We',
 'can',
 'choose',
 'any',
 'method',
 'based',
 'on',
 'language',
 ',',
 'library',
 'and',
 'purpose',
 'of',
 'modeling',
 '.']

## Stemming

It is the process of reducing a word to its base form, known as the `stem`. This `stem` may **not** be a valid word in itself, but it serves as the foundation to which prefixes and suffixes are attached.

In [7]:
from nltk.stem import (
    PorterStemmer,
    SnowballStemmer,
    RegexpStemmer
)

In [8]:
words = ["fairly", "goes", "ingesting", "eating", "eats", "eaten", "writing",
         "writes", "programming", "programs", "history", "finally", "finalized", "sportingly"]

### PorterStemmer

In [9]:
porter = PorterStemmer()

for word in words:
    print(f"{word:12} ----> {porter.stem(word)}")

fairly       ----> fairli
goes         ----> goe
ingesting    ----> ingest
eating       ----> eat
eats         ----> eat
eaten        ----> eaten
writing      ----> write
writes       ----> write
programming  ----> program
programs     ----> program
history      ----> histori
finally      ----> final
finalized    ----> final
sportingly   ----> sportingli


### SnowballStemmer

In [10]:
snowball = SnowballStemmer(language='english', ignore_stopwords=False)

for word in words:
    print(f"{word:12} ----> {snowball.stem(word)}")

fairly       ----> fair
goes         ----> goe
ingesting    ----> ingest
eating       ----> eat
eats         ----> eat
eaten        ----> eaten
writing      ----> write
writes       ----> write
programming  ----> program
programs     ----> program
history      ----> histori
finally      ----> final
finalized    ----> final
sportingly   ----> sport


### RegexpStemmer

In [11]:
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

for word in words:
    print(f"{word:12} ----> {regexp.stem(word)}")

fairly       ----> fairly
goes         ----> goe
ingesting    ----> ingest
eating       ----> eat
eats         ----> eat
eaten        ----> eaten
writing      ----> writ
writes       ----> write
programming  ----> programm
programs     ----> program
history      ----> history
finally      ----> finally
finalized    ----> finalized
sportingly   ----> sportingly


## Lemmatization

The process of reducing a word to its base or dictionary form, called a `lemma`.

In [12]:
from nltk.stem import (
    WordNetLemmatizer
)

In [13]:
words = ["fairly", "goes", "ingesting", "eating", "eats", "eaten", "writing",
         "writes", "programming", "programs", "history", "finally", "finalized", "sportingly"]

### WordNetLemmatizer

In [14]:
nltk.download('wordnet')

wordnet = WordNetLemmatizer()

for word in words:
    print(f"{word:12} ----> {wordnet.lemmatize(word=word, pos='v')}")

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


fairly       ----> fairly
goes         ----> go
ingesting    ----> ingest
eating       ----> eat
eats         ----> eat
eaten        ----> eat
writing      ----> write
writes       ----> write
programming  ----> program
programs     ----> program
history      ----> history
finally      ----> finally
finalized    ----> finalize
sportingly   ----> sportingly
