### NLP

- Basic Terminologies
1. Corpus => Paragraph
2. document => sentence
3. word => word
4. Vocabulary => Unique words

In [3]:
import nltk

In [4]:
corpus = """My name is Prathamesh Gokulkar. I am learning Natural Language Processing using Python's NLTK library. NLTK is a powerful tool for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
"""

In [5]:
corpus

"My name is Prathamesh Gokulkar. I am learning Natural Language Processing using Python's NLTK library. NLTK is a powerful tool for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.\n"

## Tokenization

The process of converting a sequence of text into smaller parts, known as tokens.  

These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.

In [6]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [8]:
# Sentence Tokenization
documents = sent_tokenize(corpus)
documents

['My name is Prathamesh Gokulkar.',
 "I am learning Natural Language Processing using Python's NLTK library.",
 'NLTK is a powerful tool for working with human language data (text).',
 'It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.']

In [9]:
words = word_tokenize(corpus)
words

['My',
 'name',
 'is',
 'Prathamesh',
 'Gokulkar',
 '.',
 'I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing',
 'using',
 'Python',
 "'s",
 'NLTK',
 'library',
 '.',
 'NLTK',
 'is',
 'a',
 'powerful',
 'tool',
 'for',
 'working',
 'with',
 'human',
 'language',
 'data',
 '(',
 'text',
 ')',
 '.',
 'It',
 'provides',
 'easy-to-use',
 'interfaces',
 'to',
 'over',
 '50',
 'corpora',
 'and',
 'lexical',
 'resources',
 'such',
 'as',
 'WordNet',
 ',',
 'along',
 'with',
 'a',
 'suite',
 'of',
 'text',
 'processing',
 'libraries',
 'for',
 'classification',
 ',',
 'tokenization',
 ',',
 'stemming',
 ',',
 'tagging',
 ',',
 'parsing',
 ',',
 'and',
 'semantic',
 'reasoning',
 '.']

In [10]:
from nltk.tokenize import wordpunct_tokenize

# WordPunct Tokenization
wordpunct_tokenize(corpus)

['My',
 'name',
 'is',
 'Prathamesh',
 'Gokulkar',
 '.',
 'I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing',
 'using',
 'Python',
 "'",
 's',
 'NLTK',
 'library',
 '.',
 'NLTK',
 'is',
 'a',
 'powerful',
 'tool',
 'for',
 'working',
 'with',
 'human',
 'language',
 'data',
 '(',
 'text',
 ').',
 'It',
 'provides',
 'easy',
 '-',
 'to',
 '-',
 'use',
 'interfaces',
 'to',
 'over',
 '50',
 'corpora',
 'and',
 'lexical',
 'resources',
 'such',
 'as',
 'WordNet',
 ',',
 'along',
 'with',
 'a',
 'suite',
 'of',
 'text',
 'processing',
 'libraries',
 'for',
 'classification',
 ',',
 'tokenization',
 ',',
 'stemming',
 ',',
 'tagging',
 ',',
 'parsing',
 ',',
 'and',
 'semantic',
 'reasoning',
 '.']

## Stemming
Stemming is an important text-processing technique that reduces words to their base or root form by removing prefixes and suffixes.  
This process standardizes words which helps to improve the efficiency and effectiveness of various natural language processing (NLP) tasks.

### PorterStemmer
The stemmed output is not guaranteed to be a meaningful word 

In [30]:
from nltk.stem import PorterStemmer

words = ["running", "runner", "runs", "easily", "fairly", "fishing", "fished", "happiness"]

In [None]:
stemming = PorterStemmer()

for word in words:
    print(f"{word} --> {stemming.stem(word)}")

running --> run
runner --> runner
runs --> run
easily --> easili
fairly --> fairli
fishing --> fish
fished --> fish
happiness --> happi


### RegexpStemmer

In [11]:
from nltk.stem import RegexpStemmer

In [22]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

for word in words:
    print(f"{word} --> {stemming.stem(word)}")

running --> run
runner --> runner
runs --> run
easily --> easili
fairly --> fairli
fishing --> fish
fished --> fish
happiness --> happi


### Snowball Stemmer
Enhanced version of the Porter Stemmer  
One of the key advantages of this is that it supports multiple languages, making it a multilingual stemmer.

In [23]:
from nltk.stem import SnowballStemmer

snow_stemmer = SnowballStemmer(language='english')

for word in words:
    print(f"{word} --> {snow_stemmer.stem(word)}")  

running --> run
runner --> runner
runs --> run
easily --> easili
fairly --> fair
fishing --> fish
fished --> fish
happiness --> happi


## Lemmatization
Unlike stemming which simply removes prefixes or suffixes, it considers the word's meaning and part of speech (POS) and ensures that the base form is a valid word

### WordNet Lemmatization

In [25]:
from nltk.stem import WordNetLemmatizer

In [29]:
lemmatizer = WordNetLemmatizer()

for word in words:
    print(f"{word} --> {lemmatizer.lemmatize(word, pos= 'v') }")
    # pos = 'v' for verb
    # pos = 'n' for noun
    # pos = 'a' for adjective
    # pos = 'r' for adverb

running --> run
runner --> runner
runs --> run
easily --> easily
fairly --> fairly
fishing --> fish
fished --> fish
happiness --> happiness
