# Natural Language Processing - Basics
Natural Language Processing (NLP) is a part of artificial intelligence that helps computers understand human language — like English.

It lets computers:

- Read what we write.
- Hear what we say.
- Understand the meaning.
- Reply back like a human.

In [75]:
# Installing Library
!pip install nltk




In [76]:
# Importing Necessary Libraries
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, TreebankWordTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import wordnet


In [77]:
# Taking a corpus
corpus = """In Natural Language Processing (NLP), a corpus is a large collection of written or spoken texts.
It is like a dataset made of language. Researchers and machines use a corpus to learn how people speak and write.
For example, a corpus may contain books, articles, tweets, or even chat messages.
By analyzing this data, these models can learn grammar, vocabulary, sentence structure, and meaning.
"""

In [78]:
# Printing the corpus
print(corpus)


In Natural Language Processing (NLP), a corpus is a large collection of written or spoken texts. 
It is like a dataset made of language. Researchers and machines use a corpus to learn how people speak and write. 
For example, a corpus may contain books, articles, tweets, or even chat messages. 
By analyzing this data, these models can learn grammar, vocabulary, sentence structure, and meaning.



In [79]:
# punkt is a tool inside NLTK that knows how to split sentences and words properly.
import nltk
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 1. Tokenization
Tokenization means breaking a sentence into smaller parts, like words or pieces.

For example:
Sentence: "I love coding!"
After tokenization: ["I", "love", "coding", "!"]

#### 1 - Sentence Tokenization
Sentence tokenization means breaking a paragraph into sentences.

In [80]:
# Performing tokenization
document = sent_tokenize(corpus)

# Checking type
type(document)


list

In [81]:
# Checking eah sentence after tokenization using loop
for sentence in document:
  print(sentence)


In Natural Language Processing (NLP), a corpus is a large collection of written or spoken texts.
It is like a dataset made of language.
Researchers and machines use a corpus to learn how people speak and write.
For example, a corpus may contain books, articles, tweets, or even chat messages.
By analyzing this data, these models can learn grammar, vocabulary, sentence structure, and meaning.


#### 2 - Word Tokenization
Word tokenization means breaking a sentence into words.

In [82]:
# Doing word tokenization
words = word_tokenize(corpus)
type(words)


list

In [83]:
# Displaying each word in corpus
for word in words:
  print(word)


In
Natural
Language
Processing
(
NLP
)
,
a
corpus
is
a
large
collection
of
written
or
spoken
texts
.
It
is
like
a
dataset
made
of
language
.
Researchers
and
machines
use
a
corpus
to
learn
how
people
speak
and
write
.
For
example
,
a
corpus
may
contain
books
,
articles
,
tweets
,
or
even
chat
messages
.
By
analyzing
this
data
,
these
models
can
learn
grammar
,
vocabulary
,
sentence
structure
,
and
meaning
.


In [84]:
# Can also be done by doing
for sentence in document:
  print(word_tokenize(sentence))


['In', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', ',', 'a', 'corpus', 'is', 'a', 'large', 'collection', 'of', 'written', 'or', 'spoken', 'texts', '.']
['It', 'is', 'like', 'a', 'dataset', 'made', 'of', 'language', '.']
['Researchers', 'and', 'machines', 'use', 'a', 'corpus', 'to', 'learn', 'how', 'people', 'speak', 'and', 'write', '.']
['For', 'example', ',', 'a', 'corpus', 'may', 'contain', 'books', ',', 'articles', ',', 'tweets', ',', 'or', 'even', 'chat', 'messages', '.']
['By', 'analyzing', 'this', 'data', ',', 'these', 'models', 'can', 'learn', 'grammar', ',', 'vocabulary', ',', 'sentence', 'structure', ',', 'and', 'meaning', '.']


#### 3 - Word Punctuation Tokenization
- Just splits words based on punctuation.
- Breaks "don't" into "don" and "'" and "t"

In [85]:
# Doing the word punctuation tokenization
wordPunctuation = wordpunct_tokenize(corpus)
type(wordPunctuation)


list

In [86]:
# Displaying
for word in wordPunctuation:
  print(word)


In
Natural
Language
Processing
(
NLP
),
a
corpus
is
a
large
collection
of
written
or
spoken
texts
.
It
is
like
a
dataset
made
of
language
.
Researchers
and
machines
use
a
corpus
to
learn
how
people
speak
and
write
.
For
example
,
a
corpus
may
contain
books
,
articles
,
tweets
,
or
even
chat
messages
.
By
analyzing
this
data
,
these
models
can
learn
grammar
,
vocabulary
,
sentence
structure
,
and
meaning
.


#### 4 - Tree Bank Word Tokenizer
TreebankWordTokenizer is a smart tokenizer that splits text into words using grammar rules, especially handling punctuation and contractions nicely.

In [87]:
# Using TreeBankWordTokenizer
tokenizer = TreebankWordTokenizer()

treeB = tokenizer.tokenize(corpus)
type(treeB)

list

In [88]:
# Displaying
for word in treeB:
  print(word)

In
Natural
Language
Processing
(
NLP
)
,
a
corpus
is
a
large
collection
of
written
or
spoken
texts.
It
is
like
a
dataset
made
of
language.
Researchers
and
machines
use
a
corpus
to
learn
how
people
speak
and
write.
For
example
,
a
corpus
may
contain
books
,
articles
,
tweets
,
or
even
chat
messages.
By
analyzing
this
data
,
these
models
can
learn
grammar
,
vocabulary
,
sentence
structure
,
and
meaning
.


## Stemming And Lemmatization
- Stemming means cutting a word to its base/root form, even if the result is not a real word. It’s fast, but not always accurate.

- Lemmatizing means turning a word into its real base word (called a lemma) using grammar rules. It’s slower but more accurate than stemming.


#### 1- Stemming

In [89]:
# Making a list
words = [
    "running",
    "flies",
    "easily",
    "studies",
    "happily",
    "better",
    "played",
    "playing",
    "children"
]


In [90]:
# Performing Stemming
stemmer = PorterStemmer()

# Applying for loop
print("STEMMING:\n")
for w in words:
  print(w, ":", stemmer.stem(w))


STEMMING:

running : run
flies : fli
easily : easili
studies : studi
happily : happili
better : better
played : play
playing : play
children : children


#### 2 - Lemmatization

In [91]:
# Performing Lemmatization
lemmatizer = WordNetLemmatizer()

# Using for loop for lemmatization
print("Lemmatization:\n")
for w in words:
  print(w, ":", lemmatizer.lemmatize(w))


Lemmatization:

running : running
flies : fly
easily : easily
studies : study
happily : happily
better : better
played : played
playing : playing
children : child
