**Why NLP is important in GenAI**


*   LLMs are built on NLP concepts

*   Prompt understanding depends on NLP

*   Prompt understanding depends on NLP

    RAG uses NLP for retrieval and similarity search

**Why do we use NLP?**


*   Computers do not naturally understand human language.

*   NLP enables machines to read and process text or speech.
*   It helps interpret the meaning and intent of human language.
*   NLP allows machines to generate meaningful and relevant responses.


**Basics of Natural Language Processing (NLP)**

Natural Language Processing (NLP) is a field of AI that enables machines to understand, process, and generate human language.

**Token:**

  A token is a small unit of information that a computer understands.
In AI, it can be a word, part of a word, or a symbol used to process text.

**Corpus:**

A corpus is a large collection of text data used for language analysis or training AI models.
It can include books, articles, websites, or conversations and helps models learn language patterns.

We make tokens from a corpus by applying tokenization to the text.

**Tokenization** is the process of breaking text into small units called tokens.

**Normalization** means converting text into a standard format like lowercase and removing extra symbols.

**Stemming** means cutting words to their root form.

**Lemmatization** means converting words to their meaningful base form using grammar.


**NLTK (Natural Language Toolkit)** is a Python library used for Natural Language Processing (NLP) tasks.

**Common NLP tasks using NLTK:**


*   Tokenization

*   Stopword removal

*   Stemming
*   Lemmatization

*   Part-of-Speech (POS) tagging


*   Named Entity Recognition (NER)



punkt is a pre-trained tokenizer model used in NLTK for sentence and word tokenization.

Stemming

    Stemming is the process of reducing words to their root form by removing prefixes or suffixes.
    It uses simple rules
    The root word may not be a valid dictionary word
    * played → play  
    * studies → studi

Lemmatization is the process of converting words to their actual base or dictionary form (lemma).

    It considers meaning and grammar.
    The output is always a valid word.
    * playing → play  
    * better → good


POS tagging is the process of assigning grammatical labels (parts of speech) to each word in a sentence.

    Common POS Tags
    NN – Noun
    VB – Verb
    JJ – Adjective
    RB – Adverb
    PRP – Pronoun
    IN – Preposition

In [None]:
#installing nltk
! pip install nltk



In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('all')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nlt

True

In [None]:
#tokenizing using punkt_tab
from nltk.tokenize import sent_tokenize,word_tokenize
text="GeeksforGeeks is a great learning platform.It is one of the best for Computer Science students."
print(sent_tokenize(text))
print(word_tokenize(text))

['GeeksforGeeks is a great learning platform.It is one of the best for Computer Science students.']


In [None]:
#Normalization
nltk.download('stopwords')
from nltk.corpus import stopwords


text = "The quick brown fox jumps over the lazy dog."
# Convert to lowercase
normalized_text = text.lower()
print(normalized_text)


the quick brown fox jumps over the lazy dog.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#removing numbers
import re

text = "GATE 2025 exam is on 15 Feb"
cleaned = re.sub(r'\d+', '', text)
print(cleaned)


GATE  exam is on  Feb


In [None]:
import re

text = "Learning NLP!!! is fun :)"
cleaned = re.sub(r'[^a-zA-Z\s]', '', text)
print(cleaned)


Learning NLP is fun 


In [None]:
import re

text = "Students are Learning NLP Techniques!!!"
text = text.lower()
cleaned = re.sub(r'[^a-z\s]', '', text)
print(cleaned)


students are learning nlp techniques


In [None]:
#stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runs", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'run', 'ran']


In [None]:
#lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))



play
play
play
play


In [None]:
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("plays"))
print(lemmatizer.lemmatize("played"))
print(lemmatizer.lemmatize("play"))
print(lemmatizer.lemmatize("playing"))


play
played
play
playing


In [None]:
#pos tagging
from nltk import pos_tag
from nltk import word_tokenize

text = "GeeksforGeeks is a Computer Science platform."
tokenized_text = word_tokenize(text)
tags = tokens_tag = pos_tag(tokenized_text)
tags

[('GeeksforGeeks', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('Computer', 'NNP'),
 ('Science', 'NNP'),
 ('platform', 'NN'),
 ('.', '.')]