<a href="https://colab.research.google.com/github/marimuthuc/nlp-learning/blob/main/rp_nltk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Basics of NLP with NLTK Library
This notebook explores the basics concepts like tokenizing, lemmatizing, stemming, etc with the help of NLTK library. 
> Reference: https://realpython.com/nltk-nlp-python/
## Tokenizing
> Splitting the text either by `word` or `sentence`.

In [1]:
# Import Libraries
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# word Tokenizing
word_text = "Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more."
words = word_tokenize(word_text)
print(words)
len(words)

['Words', 'are', 'like', 'the', 'atoms', 'of', 'natural', 'language', '.', 'They', '’', 're', 'the', 'smallest', 'unit', 'of', 'meaning', 'that', 'still', 'makes', 'sense', 'on', 'its', 'own', '.', 'Tokenizing', 'your', 'text', 'by', 'word', 'allows', 'you', 'to', 'identify', 'words', 'that', 'come', 'up', 'particularly', 'often', '.', 'For', 'example', ',', 'if', 'you', 'were', 'analyzing', 'a', 'group', 'of', 'job', 'ads', ',', 'then', 'you', 'might', 'find', 'that', 'the', 'word', '“', 'Python', '”', 'comes', 'up', 'often', '.', 'That', 'could', 'suggest', 'high', 'demand', 'for', 'Python', 'knowledge', ',', 'but', 'you', '’', 'd', 'need', 'to', 'look', 'deeper', 'to', 'know', 'more', '.']


89

In [5]:
# Sentence Tokenizing
sentence_text = "When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?"
sentences = sent_tokenize(sentence_text)
print(sentences)
len(sentences)

['When you tokenize by sentence, you can analyze how those words relate to one another and see more context.', 'Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python?', 'Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?']


3

## Stop Words Removal
> Removing common English words that are insignificant for analysis

In [3]:
# Downloading and importing stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

text = "Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves."
words = word_tokenize(text)

stop_words = set(stopwords.words("english"))
significant_words = [word for word in words if word.casefold() not in stop_words]
print(significant_words)
len(significant_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['Stop', 'words', 'words', 'want', 'ignore', ',', 'filter', 'text', '’', 'processing', '.', 'common', 'words', 'like', "'in", "'", ',', "'is", "'", ',', "'an", "'", 'often', 'used', 'stop', 'words', 'since', '’', 'add', 'lot', 'meaning', 'text', '.']


33

## Stemming
> Reducing words to their root (sometimes produce meaningless words)

In [4]:
# Porter Stemmer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemming_text = "Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter stemmer."
stemmer = PorterStemmer()

stemming_words = word_tokenize(stemming_text)

stemmed_text = [stemmer.stem(word) for word in stemming_words]

print(stemmed_text)

['stem', 'is', 'a', 'text', 'process', 'task', 'in', 'which', 'you', 'reduc', 'word', 'to', 'their', 'root', ',', 'which', 'is', 'the', 'core', 'part', 'of', 'a', 'word', '.', 'for', 'exampl', ',', 'the', 'word', '“', 'help', '”', 'and', '“', 'helper', '”', 'share', 'the', 'root', '“', 'help.', '”', 'stem', 'allow', 'you', 'to', 'zero', 'in', 'on', 'the', 'basic', 'mean', 'of', 'a', 'word', 'rather', 'than', 'all', 'the', 'detail', 'of', 'how', 'it', '’', 's', 'be', 'use', '.', 'nltk', 'ha', 'more', 'than', 'one', 'stemmer', ',', 'but', 'you', '’', 'll', 'be', 'use', 'the', 'porter', 'stemmer', '.']


## Lemmatization
> Reducing words to their `lemma` (base or dictionary form of a word)