### NLP

- Basic Terminologies
1. Corpus => Paragraph
2. document => sentence
3. word => word
4. Vocabulary => Unique words

In [47]:
sample_text = """
Apple Inc. is planning to open a new office in Bangalore, India by next year. 
Tim Cook, the CEO of Apple, announced this during a press conference on September 5, 2025. 
The company aims to hire more than 5,000 engineers specializing in AI and machine learning. 
Meanwhile, Google and Microsoft are also expanding their operations in India.
"""


In [19]:
import nltk

In [20]:
corpus = """My name is Prathamesh Gokulkar. I am learning Natural Language Processing using Python's NLTK library. NLTK is a powerful tool for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
"""

In [21]:
corpus

"My name is Prathamesh Gokulkar. I am learning Natural Language Processing using Python's NLTK library. NLTK is a powerful tool for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.\n"

## Tokenization

The process of converting a sequence of text into smaller parts, known as tokens.  

These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.

In [22]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [23]:
# Sentence Tokenization
documents = sent_tokenize(corpus)
documents

['My name is Prathamesh Gokulkar.',
 "I am learning Natural Language Processing using Python's NLTK library.",
 'NLTK is a powerful tool for working with human language data (text).',
 'It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.']

In [24]:
words = word_tokenize(corpus)
words

['My',
 'name',
 'is',
 'Prathamesh',
 'Gokulkar',
 '.',
 'I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing',
 'using',
 'Python',
 "'s",
 'NLTK',
 'library',
 '.',
 'NLTK',
 'is',
 'a',
 'powerful',
 'tool',
 'for',
 'working',
 'with',
 'human',
 'language',
 'data',
 '(',
 'text',
 ')',
 '.',
 'It',
 'provides',
 'easy-to-use',
 'interfaces',
 'to',
 'over',
 '50',
 'corpora',
 'and',
 'lexical',
 'resources',
 'such',
 'as',
 'WordNet',
 ',',
 'along',
 'with',
 'a',
 'suite',
 'of',
 'text',
 'processing',
 'libraries',
 'for',
 'classification',
 ',',
 'tokenization',
 ',',
 'stemming',
 ',',
 'tagging',
 ',',
 'parsing',
 ',',
 'and',
 'semantic',
 'reasoning',
 '.']

In [25]:
from nltk.tokenize import wordpunct_tokenize

# WordPunct Tokenization
wordpunct_tokenize(corpus)

['My',
 'name',
 'is',
 'Prathamesh',
 'Gokulkar',
 '.',
 'I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing',
 'using',
 'Python',
 "'",
 's',
 'NLTK',
 'library',
 '.',
 'NLTK',
 'is',
 'a',
 'powerful',
 'tool',
 'for',
 'working',
 'with',
 'human',
 'language',
 'data',
 '(',
 'text',
 ').',
 'It',
 'provides',
 'easy',
 '-',
 'to',
 '-',
 'use',
 'interfaces',
 'to',
 'over',
 '50',
 'corpora',
 'and',
 'lexical',
 'resources',
 'such',
 'as',
 'WordNet',
 ',',
 'along',
 'with',
 'a',
 'suite',
 'of',
 'text',
 'processing',
 'libraries',
 'for',
 'classification',
 ',',
 'tokenization',
 ',',
 'stemming',
 ',',
 'tagging',
 ',',
 'parsing',
 ',',
 'and',
 'semantic',
 'reasoning',
 '.']

## Stemming
Stemming is an important text-processing technique that reduces words to their base or root form by removing prefixes and suffixes.  
This process standardizes words which helps to improve the efficiency and effectiveness of various natural language processing (NLP) tasks.

### PorterStemmer
The stemmed output is not guaranteed to be a meaningful word 

In [26]:
from nltk.stem import PorterStemmer

words = ["running", "runner", "runs", "easily", "fairly", "fishing", "fished", "happiness"]

In [27]:
stemming = PorterStemmer()

for word in words:
    print(f"{word} --> {stemming.stem(word)}")

running --> run
runner --> runner
runs --> run
easily --> easili
fairly --> fairli
fishing --> fish
fished --> fish
happiness --> happi


### RegexpStemmer

In [28]:
from nltk.stem import RegexpStemmer

In [29]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

for word in words:
    print(f"{word} --> {stemming.stem(word)}")

running --> run
runner --> runner
runs --> run
easily --> easili
fairly --> fairli
fishing --> fish
fished --> fish
happiness --> happi


### Snowball Stemmer
Enhanced version of the Porter Stemmer  
One of the key advantages of this is that it supports multiple languages, making it a multilingual stemmer.

In [30]:
from nltk.stem import SnowballStemmer

snow_stemmer = SnowballStemmer(language='english')

for word in words:
    print(f"{word} --> {snow_stemmer.stem(word)}")  

running --> run
runner --> runner
runs --> run
easily --> easili
fairly --> fair
fishing --> fish
fished --> fish
happiness --> happi


## Lemmatization
Unlike stemming which simply removes prefixes or suffixes, it considers the word's meaning and part of speech (POS) and ensures that the base form is a valid word

### WordNet Lemmatization

In [31]:
from nltk.stem import WordNetLemmatizer

In [32]:
lemmatizer = WordNetLemmatizer()

for word in words:
    print(f"{word} --> {lemmatizer.lemmatize(word, pos= 'v') }")
    # pos = 'v' for verb
    # pos = 'n' for noun
    # pos = 'a' for adjective
    # pos = 'r' for adverb

running --> run
runner --> runner
runs --> run
easily --> easily
fairly --> fairly
fishing --> fish
fished --> fish
happiness --> happiness


## Stopword removal

In [46]:
from nltk.corpus import stopwords

stopword = stopwords.words('english')
stopword

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [42]:
tokenized_corpus = word_tokenize(corpus)

In [45]:
lemmas = [lemmatizer.lemmatize(word) for word in tokenized_corpus if word not in stopword]

print(lemmas)

['My', 'name', 'Prathamesh', 'Gokulkar', '.', 'I', 'learning', 'Natural', 'Language', 'Processing', 'using', 'Python', "'s", 'NLTK', 'library', '.', 'NLTK', 'powerful', 'tool', 'working', 'human', 'language', 'data', '(', 'text', ')', '.', 'It', 'provides', 'easy-to-use', 'interface', '50', 'corpus', 'lexical', 'resource', 'WordNet', ',', 'along', 'suite', 'text', 'processing', 'library', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'semantic', 'reasoning', '.']


## Part of Speech Tagging

In [None]:
import nltk
from nltk import pos_tag

In [36]:
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

text = "The quick brown fox jumps over the lazy dog"

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


In [38]:
text = word_tokenize(text)
print(text)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


In [39]:
pos_tags = pos_tag(text)
pos_tags

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

## NER

NER require POS tagged text

In [55]:
from nltk import ne_chunk

#Necessary dowmloads for NER

nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [57]:
sentences = sent_tokenize(sample_text)

for sentence in sentences:
    words = word_tokenize(sentence)
    pos_tags = pos_tag(words)
    named_entities = ne_chunk(pos_tags)
    print(named_entities)

(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  is/VBZ
  planning/VBG
  to/TO
  open/VB
  a/DT
  new/JJ
  office/NN
  in/IN
  (GPE Bangalore/NNP)
  ,/,
  (GPE India/NNP)
  by/IN
  next/JJ
  year/NN
  ./.)
(S
  (PERSON Tim/NNP)
  (GPE Cook/NNP)
  ,/,
  the/DT
  (ORGANIZATION CEO/NNP)
  of/IN
  (GPE Apple/NNP)
  ,/,
  announced/VBD
  this/DT
  during/IN
  a/DT
  press/NN
  conference/NN
  on/IN
  September/NNP
  5/CD
  ,/,
  2025/CD
  ./.)
(S
  The/DT
  company/NN
  aims/VBZ
  to/TO
  hire/VB
  more/JJR
  than/IN
  5,000/CD
  engineers/NNS
  specializing/VBG
  in/IN
  AI/NNP
  and/CC
  machine/NN
  learning/NN
  ./.)
(S
  Meanwhile/RB
  ,/,
  (PERSON Google/NNP)
  and/CC
  (ORGANIZATION Microsoft/NNP)
  are/VBP
  also/RB
  expanding/VBG
  their/PRP$
  operations/NNS
  in/IN
  (GPE India/NNP)
  ./.)
