Every Beginner NLP Engineer must know these Techniques

https://ankushmulkar.medium.com/every-beginner-nlp-engineer-must-know-these-techniques-678605dc6026

In [3]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements, known as tokens.

In [5]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

text = "This is an example of tokenization."
tokens = word_tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'of', 'tokenization', '.']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['This', 'is', 'an', 'example', 'of', 'tokenization', '.']


Lemmatization is the process of reducing a word to its base or root form, called a lemma. Stemming is a similar process, but it often results in words that are not actual words.

In [7]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("running"))
# Output: 'running'
print(lemmatizer.lemmatize("ran"))
# Output: 'run'

[nltk_data] Downloading package wordnet to /root/nltk_data...


running
ran


In Natural Language Processing (NLP), “steaming” refers to the process of reducing a word to its base or root form. This is often done to group together different forms of a word so they can be analyzed together as a single item.

In [8]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('running'))
# Output: 'run'
print(stemmer.stem('runner'))
# Output: 'runner'

run
runner


Part-of-speech (POS) tagging is the process of marking each word in a text with its corresponding POS tag.

In [10]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')
text = "I am learning NLP techniques in Python."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print(pos_tags)
# Output: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('techniques', 'NNS'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('techniques', 'NNS'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]


Named Entity Recognition (NER) is the process of identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [13]:
from nltk import ne_chunk
from nltk.tokenize import word_tokenize
nltk.download('maxent_ne_chunker')
nltk.download('words')

text = "Barack Obama was born in Hawaii."
tokens = word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
ner_tree = ne_chunk(tagged_tokens)
print(ner_tree)
# Output: (S (PERSON Barack))

(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP)
  ./.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Sentiment Analysis is the process of determining the emotional tone behind a piece of text, whether it is positive, negative, or neutral.

In [15]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

text = "I love this product! It's amazing."
sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores(text)
print(score)
# Output: {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

{'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.8516}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Text Classification is the process of assigning predefined categories or tags to a piece of text.

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Create a dataset
data = {'text': ['This is a positive text.', 'This is a negative text.'], 'label': ['positive', 'negative']}
df = pd.DataFrame(data)
# Create a CountVectorizer object
vectorizer = CountVectorizer()
# Transform the text column
X = vectorizer.fit_transform(df['text'])
# Create a MultinomialNB object
clf = MultinomialNB()
# Fit the model
clf.fit(X, df['label'])
# Test the model
text = "This is a neutral text."
X_test = vectorizer.transform([text])
pred = clf.predict(X_test)
print(pred)
# Output: ['positive']

['negative']


Language Translation is the process of converting text from one language to another.

In [None]:
from googletrans import Translator
translator = Translator()
text = "I am learning NLP techniques in Python."
translated_text = translator.translate(text, dest='fr').text
print(translated_text)
# Output: "Je apprends des techniques NLP en Python."