## Text Analytics - Knowledge Graph, BERT, spaCy, NLTK - Notebook 04

This noteboook covers some cool language modeling and natural language processing tools and methods. \
References: 
https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk

In [1]:
import nltk
# !python -m nltk.downloader all 

from nltk.tokenize import sent_tokenize,word_tokenize

<b>Tokenization:</b>

In [2]:
example_text = "Text Analytics - Knowledge Graph, BERT, spaCy, NLTK - Notebook 04"
print(word_tokenize(example_text))

['Text', 'Analytics', '-', 'Knowledge', 'Graph', ',', 'BERT', ',', 'spaCy', ',', 'NLTK', '-', 'Notebook', '04']


<b>Stop Words:</b>

In [3]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
print(stop_words)

{'how', 'should', 'hers', "don't", 've', "hadn't", 'that', 'any', "won't", 'being', 'if', 'after', 'just', 'theirs', 'for', 'these', 'shouldn', 'up', 'once', 'but', 'too', 'their', 'such', "mustn't", 'him', 'did', 'to', 'mightn', 'will', 'those', 'so', 'hadn', 'most', 'between', 'needn', 'there', 'while', 'you', 'myself', 'having', 'nor', 'now', 'who', 'have', "wouldn't", 'll', 'wouldn', 'ours', 'himself', 'the', 'out', 'its', 'all', 'your', 'hasn', 'didn', 'before', 'when', 'has', 'doing', 'doesn', 'through', 'm', 'y', 'off', 'weren', 'do', 'against', 'ourselves', 'had', 'be', 'in', "weren't", "you'll", 'we', 'am', "you're", 'wasn', 'a', 'both', 'can', 'an', "that'll", 'why', "aren't", 'i', 'by', 'over', 'he', 'down', "should've", 'with', 'and', 's', "hasn't", 'isn', 'his', 'them', 'until', 'are', 'during', 'at', 'here', 'yourselves', 'ma', 'yourself', 'aren', 'were', "shouldn't", 'themselves', 'herself', 'she', 'then', 'shan', 'does', 'under', "wasn't", 'on', 'below', 'is', "you've",

In [4]:
example_text = "Ruchi is a good creature on this planet called Earth."

words = word_tokenize(example_text)

filtered_sentence = []
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
        
print(filtered_sentence) 

['Ruchi', 'good', 'creature', 'planet', 'called', 'Earth', '.']


<b>Stemming:</b>

In [5]:
from nltk.stem import PorterStemmer

example_text = "Ruchi is having coffee, but why is ruchi having coffee when she knows it is not good for health?"

sentences = sent_tokenize(example_text)
stemmer = PorterStemmer()
new_sentence = []

for i in range(len(sentences)):
    words = word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words]
    new_sentence.append(' '.join(words))
    
print(new_sentence)

['ruchi is have coffe , but whi is ruchi have coffe when she know it is not good for health ?']


<b>Lemmatization:</b>

In [14]:
from nltk.stem import WordNetLemmatizer

sentences = sent_tokenize(example_text)
lemmtizer = WordNetLemmatizer()

new__lemmatize_sentence = []

for i in range(len(sentences)):
    words = word_tokenize(sentences[i])
    words = [lemmtizer.lemmatize(word) for word in words]
    new__lemmatize_sentence.append(' '.join(words))
    
print(new__lemmatize_sentence)

['Ruchi is having coffee , but why is ruchi having coffee when she know it is not good for health ?']


<b>POS Tagging:</b>\
Let's use a new sentence tokenizer, called the PunktSentenceTokenizer. This tokenizer is capable of unsupervised machine learning, so we can actually train it on any body of text that we use.

In [15]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:
tokenized = cust_tokenizer.tokenize(sample_text)

In [16]:
print("Speech Tagging Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

process_text()

Speech Tagging Output
[('Crocodiles', 'NNS'), ('are', 'VBP'), ('large', 'JJ'), ('aquatic', 'JJ'), ('reptiles', 'NNS'), ('which', 'WDT'), ('are', 'VBP'), ('carnivorous.Allegators', 'NNS'), ('belong', 'RB'), ('to', 'TO'), ('this', 'DT'), ('same', 'JJ'), ('reptile', 'NN'), ('species', 'NNS')]


<b>Chunking:</b>\
Now that we know the parts of speech, we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In [17]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:

tokenized = cust_tokenizer.tokenize(sample_text)
print("Chunked Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk:{<NNS.?>*<JJ>+}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            #chunked.draw()
            print(chunked)

    except Exception as e:
        print(str(e))

process_text()

Chunked Output
(S
  Crocodiles/NNS
  are/VBP
  (Chunk large/JJ aquatic/JJ)
  reptiles/NNS
  which/WDT
  are/VBP
  carnivorous.Allegators/NNS
  belong/RB
  to/TO
  this/DT
  (Chunk same/JJ)
  reptile/NN
  species/NNS)


<b>Chinking:</b>\
After a lot of chunking, we have some words in your chunk you still do not want, but you have no idea how to get rid of them by chunking. You may find that chinking is your solution.

Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

In [18]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:

tokenized = cust_tokenizer.tokenize(sample_text)

print("Chinked Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            #chunked.draw()
            print(chunked)

    except Exception as e:
        print(str(e))

process_text()

Chinked Output
(S
  (Chunk Crocodiles/NNS)
  are/VBP
  (Chunk large/JJ aquatic/JJ reptiles/NNS which/WDT)
  are/VBP
  (Chunk carnivorous.Allegators/NNS belong/RB)
  to/TO
  this/DT
  (Chunk same/JJ reptile/NN species/NNS))


<b>Named Entity Recognition:</b>\
One of the most major forms of chunking in natural language processing is called "Named Entity Recognition." The idea is to have the machine immediately be able to pull out "entities" like people, places, things, locations, monetary figures, and more.

This can be a bit of a challenge, but NLTK is this built in for us. There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.    

In [19]:
from nltk.tokenize import PunktSentenceTokenizer

# Now, let's create our training and testing data:
train_txt="Crocodiles (subfamily Crocodylinae) or true crocodiles are large aquatic reptiles that live throughout the tropics in Africa, Asia, the Americas and Australia. Crocodylinae, all of whose members are considered true crocodiles, is classified as a biological subfamily. A broader sense of the term crocodile, Crocodylidae that includes Tomistoma, is not used in this article. The term crocodile here applies to only the species within the subfamily of Crocodylinae. The term is sometimes used even more loosely to include all extant members of the order Crocodilia, which includes the alligators and caimans (family Alligatoridae), the gharial and false gharial (family Gavialidae), and all other living and fossil Crocodylomorpha."
sample_text ="Crocodiles are large aquatic reptiles which are carnivorous.Allegators belong to this same reptile species"

# Next, we can train the Punkt tokenizer like:
cust_tokenizer = PunktSentenceTokenizer(train_txt)

# Then we can actually tokenize, using:

tokenized = cust_tokenizer.tokenize(sample_text)

print("Named Entity Output")
def process_text():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged,binary = True)
            namedEnt.draw()
            print(namedEnt)

    except Exception as e:
        print(str(e))

process_text()


Named Entity Output
(S
  Crocodiles/NNS
  are/VBP
  large/JJ
  aquatic/JJ
  reptiles/NNS
  which/WDT
  are/VBP
  carnivorous.Allegators/NNS
  belong/RB
  to/TO
  this/DT
  same/JJ
  reptile/NN
  species/NNS)


<b>The Corpora</b>\
The NLTK corpus is a massive dump of all kinds of natural language data sets that are definitely worth taking a look at.

Almost all of the files in the NLTK corpus follow the same rules for accessing them by using the NLTK module, but nothing is magical about them. These files are plain text files for the most part, some are XML and some are other formats, but they are all accessible by manual, or via the module and Python. Let's talk about viewing them manually.

In [20]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("bible-kjv.txt")
tok = sent_tokenize(sample)
print(tok[5:15])

['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.', '1:12 And the earth brought forth grass, and

<b><i>End.</i></b>