##NLP(Natural Language Processing)

The primary goal of NLP is to enable computers to understand, interpret, and generate human-like language in a way that is both meaningful and contextually relevant

In [1]:
#importing the libraries for string tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize


In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
text = "I started the 21 days challenge. This is the first day of learning NLP."

In [4]:
#Tokenize into words
words= word_tokenize(text)
print("tokenized word:",words)

tokenized word: ['I', 'started', 'the', '21', 'days', 'challenge', '.', 'This', 'is', 'the', 'first', 'day', 'of', 'learning', 'NLP', '.']


In [5]:
#Tokenize into sentances
sentances = sent_tokenize(text)
print("tokenized sentance",sentances)

tokenized sentance ['I started the 21 days challenge.', 'This is the first day of learning NLP.']


 Natural Language Toolkit (NLTK) library, which is a powerful library for working with human language data, including various tools for NLP.
 Tokenization is the process of breaking down text into smaller units, such as words or sentences.
 'Punkt' is a pre-trained model used by NLTK for tokenization, specifically for splitting text into words and sentences. It contains information about abbreviations and sentence boundaries.

In [6]:
#importing Libraries for frequency distribution
from collections import Counter

In [7]:
text = "Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.NLP drives computer programs that translate text from one language to another, respond to spoken commands, and summarize large volumes of text rapidly—even in real time. "

In [8]:
words = text.split()
print(words)

['Natural', 'language', 'processing', '(NLP)', 'refers', 'to', 'the', 'branch', 'of', 'computer', 'science—and', 'more', 'specifically,', 'the', 'branch', 'of', 'artificial', 'intelligence', 'or', 'AI—concerned', 'with', 'giving', 'computers', 'the', 'ability', 'to', 'understand', 'text', 'and', 'spoken', 'words', 'in', 'much', 'the', 'same', 'way', 'human', 'beings', 'can.', 'NLP', 'combines', 'computational', 'linguistics—rule-based', 'modeling', 'of', 'human', 'language—with', 'statistical,', 'machine', 'learning,', 'and', 'deep', 'learning', 'models.', 'Together,', 'these', 'technologies', 'enable', 'computers', 'to', 'process', 'human', 'language', 'in', 'the', 'form', 'of', 'text', 'or', 'voice', 'data', 'and', 'to', '‘understand’', 'its', 'full', 'meaning,', 'complete', 'with', 'the', 'speaker', 'or', 'writer’s', 'intent', 'and', 'sentiment.NLP', 'drives', 'computer', 'programs', 'that', 'translate', 'text', 'from', 'one', 'language', 'to', 'another,', 'respond', 'to', 'spoken',

In [9]:
#calculating the word frequency
word_frquency = Counter(words)
print('word_frquency:',word_frquency)

word_frquency: Counter({'to': 6, 'the': 6, 'of': 5, 'and': 5, 'text': 4, 'language': 3, 'or': 3, 'in': 3, 'human': 3, 'branch': 2, 'computer': 2, 'with': 2, 'computers': 2, 'spoken': 2, 'Natural': 1, 'processing': 1, '(NLP)': 1, 'refers': 1, 'science—and': 1, 'more': 1, 'specifically,': 1, 'artificial': 1, 'intelligence': 1, 'AI—concerned': 1, 'giving': 1, 'ability': 1, 'understand': 1, 'words': 1, 'much': 1, 'same': 1, 'way': 1, 'beings': 1, 'can.': 1, 'NLP': 1, 'combines': 1, 'computational': 1, 'linguistics—rule-based': 1, 'modeling': 1, 'language—with': 1, 'statistical,': 1, 'machine': 1, 'learning,': 1, 'deep': 1, 'learning': 1, 'models.': 1, 'Together,': 1, 'these': 1, 'technologies': 1, 'enable': 1, 'process': 1, 'form': 1, 'voice': 1, 'data': 1, '‘understand’': 1, 'its': 1, 'full': 1, 'meaning,': 1, 'complete': 1, 'speaker': 1, 'writer’s': 1, 'intent': 1, 'sentiment.NLP': 1, 'drives': 1, 'programs': 1, 'that': 1, 'translate': 1, 'from': 1, 'one': 1, 'another,': 1, 'respond': 1,

In [10]:
#Displaying  the frequency distribution
for word , frequency in word_frquency.items():
  print(f"{word}\t\t{frequency}")

Natural		1
language		3
processing		1
(NLP)		1
refers		1
to		6
the		6
branch		2
of		5
computer		2
science—and		1
more		1
specifically,		1
artificial		1
intelligence		1
or		3
AI—concerned		1
with		2
giving		1
computers		2
ability		1
understand		1
text		4
and		5
spoken		2
words		1
in		3
much		1
same		1
way		1
human		3
beings		1
can.		1
NLP		1
combines		1
computational		1
linguistics—rule-based		1
modeling		1
language—with		1
statistical,		1
machine		1
learning,		1
deep		1
learning		1
models.		1
Together,		1
these		1
technologies		1
enable		1
process		1
form		1
voice		1
data		1
‘understand’		1
its		1
full		1
meaning,		1
complete		1
speaker		1
writer’s		1
intent		1
sentiment.NLP		1
drives		1
programs		1
that		1
translate		1
from		1
one		1
another,		1
respond		1
commands,		1
summarize		1
large		1
volumes		1
rapidly—even		1
real		1
time.		1


In [11]:
#counting the total words
word_count = len(words)
print("Total word count:", word_count)

Total word count: 111


In [12]:
#counting the unique words
unique_word = set(words)
unique_word_count = len(unique_word)
print("Total unique word count:", unique_word_count)

Total unique word count: 77


In [13]:
# counitng the sentance
sentances = sent_tokenize(text)
count_sent = len(sentances)
print("total sentance count",count_sent)

total sentance count 3


In [14]:
# longest sentance
longest_sentance = max(sentances, key = len)
print("Longest sentance :",longest_sentance)

Longest sentance : Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.NLP drives computer programs that translate text from one language to another, respond to spoken commands, and summarize large volumes of text rapidly—even in real time.


In [15]:
#removing punctuation from sentances
import string

def remove_punctuation(Sentance):
  translator = str.maketrans("","",string.punctuation)
  return Sentance.translate(translator)

cleaned_sentance = [remove_punctuation(sentance) for sentance in sentances]
print("sentance without punctuation:",cleaned_sentance)

sentance without punctuation: ['Natural language processing NLP refers to the branch of computer science—and more specifically the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can', 'NLP combines computational linguistics—rulebased modeling of human language—with statistical machine learning and deep learning models', 'Together these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning complete with the speaker or writer’s intent and sentimentNLP drives computer programs that translate text from one language to another respond to spoken commands and summarize large volumes of text rapidly—even in real time']


In [16]:
#cheacking all punctuation character in string libraries
all_punctuation = string.punctuation

num_different_punctuation = len(all_punctuation)

print("Number of different punctuation characters:", num_different_punctuation, "all punctuation" , all_punctuation)


Number of different punctuation characters: 32 all punctuation !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [17]:
#Annotator Creation

data = [
    ("i love it", "positive"),
    ("i hate it", "negative"),
    ("this is not what i expected", "negative"),
    ("the quality of this item is poor", "positive")
]

In [18]:
# define the annotator

def annotate_data(data):
  annotate_data = []
  for text, label in data:
    annotate_data.append((text,label))
  return annotate_data


In [19]:
#annotate the data
annotated_data = annotate_data(data)
for text, label in annotated_data:
  print(f'Text: {text} \tLabel: {label}')

Text: i love it 	Label: positive
Text: i hate it 	Label: negative
Text: this is not what i expected 	Label: negative
Text: the quality of this item is poor 	Label: positive


Data annotation is responsible for labeling or tagging data with relevant information, typically in order to prepare high-quality training data for  AI models.

In [20]:
#Lemmatization in Text Processing
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [21]:
#initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

In [22]:
lemmatized_word = [lemmatizer.lemmatize(word) for word in words]
print("lemmatized words:", lemmatized_word)

lemmatized words: ['Natural', 'language', 'processing', '(NLP)', 'refers', 'to', 'the', 'branch', 'of', 'computer', 'science—and', 'more', 'specifically,', 'the', 'branch', 'of', 'artificial', 'intelligence', 'or', 'AI—concerned', 'with', 'giving', 'computer', 'the', 'ability', 'to', 'understand', 'text', 'and', 'spoken', 'word', 'in', 'much', 'the', 'same', 'way', 'human', 'being', 'can.', 'NLP', 'combine', 'computational', 'linguistics—rule-based', 'modeling', 'of', 'human', 'language—with', 'statistical,', 'machine', 'learning,', 'and', 'deep', 'learning', 'models.', 'Together,', 'these', 'technology', 'enable', 'computer', 'to', 'process', 'human', 'language', 'in', 'the', 'form', 'of', 'text', 'or', 'voice', 'data', 'and', 'to', '‘understand’', 'it', 'full', 'meaning,', 'complete', 'with', 'the', 'speaker', 'or', 'writer’s', 'intent', 'and', 'sentiment.NLP', 'drive', 'computer', 'program', 'that', 'translate', 'text', 'from', 'one', 'language', 'to', 'another,', 'respond', 'to', '

Lemmatization is a crucial step in text processing. It reduce word to their base form.
The WordNet Lemmatizer is a tool provided by the Natural Language Toolkit (NLTK) for lemmatizing words based on the WordNet lexical database.
WordNet is a lexical database of the English language that includes information about words, their meanings, and relationships.
