## Text Analytics - Knowledge Graph, BERT, spaCy, NLTK - Notebook 02

This noteboook covers some cool language modeling and natural language processing tools and methods.

References: \
https://spacy.io \
https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk

<b>spaCy | Industrial-Strength Natural Language Processing in Python</b>

<b>Key features:</b>
<li>80 trained pipelines for 24 languages</li>
<li>Multi-task learning with pretrained transformers like BERT</li>
<li>Pretrained word vectors</li>
<li>State-of-the-art speed</li>
<li>Production-ready training system</li>
<li>Linguistically-motivated tokenization</li>
<li>Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more</li>

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Some of spaCy’s features work independently, others require statistical models to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy and the data they include. The model you choose always depends on your use case and the texts you’re working with. For a general-purpose use case, the small, default models are always a good start. They typically include the following components: 
\

<li><u>Binary weights</u> for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
<li><u>Lexical entries</u> in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
<li><u>Data files</u> like lemmatization rules and lookup tables.
<li><u>Word vectors</u>, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
<li><u>Configuration options</u>, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.</li>

These models are the power engines of spaCy. These models enable spaCy to perform several NLP related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing. Listed below are the different statistical models in spaCy along with their specifications:

>1. en_core_web_sm: English multi-task CNN trained on OntoNotes. Size – 11 MB
>2. en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 91 MB
>3. en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 789 MB

Importing these models is super easy. We can import a model by just executing spacy.load(‘model_name’) as shown below:

In [1]:
# !pip3 install -U spacy

Collecting spacy
  Downloading spacy-3.5.1-cp38-cp38-macosx_10_9_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting spacy-legacy<3.1.0,>=3.0.11
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting pathy>=0.10.0
  Downloading pathy-0.10.1-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting thinc<8.2.0,>=8.1.8
  Downloading thinc-8.1.9-cp38-cp38-macosx_10_9_x86_64.whl (848 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m848.1/848.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: spacy-legacy, pathy, thinc, spacy
  Attempting uninstall: spacy-legacy
    Found existing installation: spacy-legacy 3.0.10
    Uninstalling spacy-legacy-3.0.10:
      Successfully uninstalled spacy-legacy-3.0.1

In [3]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:06[0m
Installing collected packages: en-core-web-lg
  Attempting uninstall: en-core-web-lg
    Found existing installation: en-core-web-lg 3.4.0
    Uninstalling en-core-web-lg-3.4.0:
      Successfully uninstalled en-core-web-lg-3.4.0
Successfully installed en-core-web-lg-3.4.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.7 install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.7 install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

  from .autonotebook import tqdm as notebook_tqdm
2023-03-23 17:37:16.927307: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
doc = nlp("Ruchi has had two cups of coffee today.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Ruchi PROPN nsubj
has AUX aux
had VERB ROOT
two NUM nummod
cups NOUN dobj
of ADP prep
coffee NOUN pobj
today NOUN npadvmod
. PUNCT punct


<b>Understanding spaCy's processing pipeline:</b>

In [3]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [4]:
nlp.disable_pipes('tagger', 'parser')

['tagger', 'parser']

In [5]:
nlp.pipe_names

['tok2vec', 'attribute_ruler', 'lemmatizer', 'ner']

<b>Tokenization:</b>
Segmenting text into words, punctuations marks etc.   

In [6]:
doc = nlp("Ruchi has had two cups of coffee today.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Ruchi  
has  
had  
two  
cups  
of  
coffee  
today  
.  




<b>POS Tagging:</b> Parts of Speech

In [7]:
for token in doc:
    print(token.tag_, token.pos_, spacy.explain(token.tag_))

  None
  None
  None
  None
  None
  None
  None
  None
  None




In [8]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("Ruchi has had two cups of coffee today.")
for token in doc:
    print(token.tag_, token.pos_, spacy.explain(token.tag_))

NNP PROPN noun, proper singular
VBZ AUX verb, 3rd person singular present
VBN VERB verb, past participle
CD NUM cardinal number
NNS NOUN noun, plural
IN ADP conjunction, subordinating or preposition
NN NOUN noun, singular or mass
NN NOUN noun, singular or mass
. PUNCT punctuation mark, sentence closer


<b>Using built-in displaCy visualizer</b>:

In [9]:
from spacy import displacy

doc = nlp("Ruchi had two cups of coffee today.")
displacy.render(doc, style="dep" , jupyter=True)

<b>Dependency Parsing:</b> process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.

In [10]:
doc = nlp("Ruchi had two cups of coffee today.")
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.dep_)

Ruchi --> nsubj
had --> ROOT
two --> nummod
cups --> dobj
of --> prep
coffee --> pobj
today --> npadvmod
. --> punct


The dependency tag ROOT denotes the main verb or action in the sentence. The other words are directly or indirectly connected to the ROOT word of the sentence. You can find out what other tags stand for by executing the code below:

In [11]:
(spacy.explain("nsubj"), 
spacy.explain("ROOT"), 
spacy.explain("aux"), 
spacy.explain("advcl"), 
spacy.explain("dobj"))

('nominal subject',
 'root',
 'auxiliary',
 'adverbial clause modifier',
 'direct object')

<b>Lemmatization</b>: process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.

In [12]:
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.lemma_)

Ruchi --> Ruchi
had --> have
two --> two
cups --> cup
of --> of
coffee --> coffee
today --> today
. --> .


<b>Sentence Boundary Detection (SBD):</b> process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

In [13]:
doc = nlp("Ruchi like coffee. But coffee adds to Ruchi's anxiety. Still Ruchi drinks a lot of coffee.")

sentences = list(doc.sents)
len(sentences)

3

In [14]:
for sentence in sentences:
     print (sentence)

Ruchi like coffee.
But coffee adds to Ruchi's anxiety.
Still Ruchi drinks a lot of coffee.


<b>Named Entity Recognition</b>: A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

In [15]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Ruchi 0 5 PERSON
Ruchi 38 43 ORG
Ruchi 61 66 PERSON


<b>Entity Detection:</b> also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, since you can quickly pick out important topics or indentify key sections of text.

In [16]:
doc = nlp(u"""The Ruchi is a mountain in the Glarus Alps, located at an elevation of 3,107 m (10,194 ft) 
            on the border between the Swiss cantons of Glarus and Graubünden. It overlooks the Muttsee 
            (2,446 m or 8,025 ft) on its west side from where a trail leads to the summit. 
            On its south-east side lies a small glacier, the Glatscher da Gavirolas. The Ruchi is 
            connected to the higher summit of the Hausstock on the north-east by a 2 km (1.2 mi) long 
            ridge.[2]

            The nearest settlements are the villages of Linthal to the north, and Andiast to the south. 
            Administratively, the mountain lies in the municipalities of Glarus Süd and Waltensburg/Vuorz.[2]
    
            The Glarus Alps (German: Glarner Alpen) are a mountain range in central Switzerland. They 
            are bordered by the Uri Alps and the Schwyz Alps to the west, the Lepontine Alps to the 
            south, the Appenzell Alps to the northeast. The eastern part of the Glarus Alps contains a 
            major thrust fault that was declared a geologic UNESCO World Heritage Site (the Swiss 
            Tectonic Arena Sardona).

            The Glarus Alps extend well beyond the canton of Glarus, including parts of the cantons of 
            Uri, Graubünden, and St Gallen. Conversely, not all the mountains in the canton of Glarus 
            are part of the Glarus Alps, with those to the north of the Urner Boden and to the west of 
            the valley of the river Linth considered to be part of the Schwyz Alps.""")

In [17]:
entities=[(i, i.label_, i.label) for i in doc.ents]
entities

[(Ruchi, 'PERSON', 380),
 (Glarus Alps, 'FAC', 9191306739292312949),
 (3,107, 'CARDINAL', 397),
 (10,194 ft, 'QUANTITY', 395),
 (Swiss, 'NORP', 381),
 (Glarus, 'GPE', 384),
 (Muttsee, 'FAC', 9191306739292312949),
 (2,446, 'CARDINAL', 397),
 (8,025, 'CARDINAL', 397),
 (Glatscher, 'PERSON', 380),
 (Ruchi, 'PERSON', 380),
 (2 km, 'QUANTITY', 395),
 (1.2, 'CARDINAL', 397),
 (Linthal, 'PERSON', 380),
 (Andiast, 'ORG', 383),
 (Glarus Süd, 'ORG', 383),
 (Waltensburg/Vuorz.[2, 'ORG', 383),
 (German, 'NORP', 381),
 (Switzerland, 'GPE', 384),
 (Schwyz, 'GPE', 384),
 (Appenzell, 'ORG', 383),
 (UNESCO World Heritage Site, 'ORG', 383),
 (the Swiss 
              Tectonic Arena Sardona,
  'ORG',
  383),
 (Glarus, 'GPE', 384),
 (St Gallen, 'PERSON', 380),
 (Glarus, 'GPE', 384),
 (the Urner Boden, 'FAC', 9191306739292312949),
 (Linth, 'ORG', 383),
 (Schwyz, 'GPE', 384),
 (Alps, 'GPE', 384)]

In [18]:
displacy.render(doc, style = "ent",jupyter = True)

In [19]:
# this is actually incorrect
# it didn't catch that Ruchi here is a mountain and not a person

<b>Similarity:</b> Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word.

In [20]:
tokens = nlp("Okay Hello Bye pspsps rsrsrsrs Ruchi Cat Time Series Analysis lololol")
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Okay True 9.226926 True
Hello True 8.065675 True
Bye True 8.693295 True
pspsps True 8.254264 True
rsrsrsrs True 7.480596 True
Ruchi True 9.379858 True
Cat True 8.8930645 True
Time True 9.393447 True
Series True 7.8311586 True
Analysis True 8.04443 True
lololol True 7.4303126 True


In [21]:
tokens = nlp("Animal Cat")
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Animal Animal 1.0
Animal Cat 0.2703475058078766
Cat Animal 0.2703475058078766
Cat Cat 1.0


  print(token1.text, token2.text, token1.similarity(token2))


In [22]:
tokens = nlp("Dog Cat")
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Dog Dog 1.0
Dog Cat 0.4548181891441345
Cat Dog 0.4548181891441345
Cat Cat 1.0


  print(token1.text, token2.text, token1.similarity(token2))


In [23]:
nlp = spacy.load("en_core_web_lg")
tokens = nlp("Animal Cat")
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))



Animal Animal 1.0
Animal Cat 0.4878849685192108
Cat Animal 0.4878849685192108
Cat Cat 1.0


In [24]:
tokens = nlp("Dog Cat")
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Dog Dog 1.0
Dog Cat 0.7704364657402039
Cat Dog 0.7704364657402039
Cat Cat 1.0


In [25]:
# above highlights the difference bw using smaller vocab vs larger vocab

<b>Text Classification:</b> Assigning categories or labels to a whole document, or parts of a document.

In [26]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

df_amazon = pd.read_csv ("amazon_alexa.tsv", sep="\t")
df_amazon.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [27]:
df_amazon.shape

(3150, 5)

In [28]:
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [29]:
df_amazon.feedback.value_counts()

1    2893
0     257
Name: feedback, dtype: int64

<b>Creating a custome tokenizer function using spaCy</b> to automatically strip information that is not required (eg, stopwords, punctuations). 

In [117]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load("en_core_web_sm")
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    
    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
        
    # return preprocessed list of tokens
    return mytokens

In [71]:
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
/opt/anaconda3/lib/python3.7/site-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_sm   >=3.4.0,<3.5.0   [38;5;2m3.4.1[0m   [38;5;2m✔[0m
en_core_web_md   >=3.4.0,<3.5.0   [38;5;2m3.4.0[0m   [38;5;2m✔[0m
en_core_web_lg   >=3.4.0,<3.5.0   [38;5;2m3.4.1[0m   [38;5;2m✔[0m



To further clean our text data, we’ll also want to create a <b>custom transformer for removing initial and end spaces and converting text into lower case</b>. Here, we will create a custom predictors class wich inherits the TransformerMixin class. This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [118]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

Representing the text numerically: <b>Bag of Words</b> \
 BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.\
We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer. In the code below, we’re telling CountVectorizer to use the custom spacy_tokenizer function we built as its tokenizer, and defining the ngram range we want. N-grams are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens.

In [119]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

<b>TF-IDF</b> (Term Frequency-Inverse Document Frequency): simply a way of normalizing our Bag of Words(BoW) by looking at each word’s frequency in comparison to the document frequency. 

In [120]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [121]:
from sklearn.model_selection import train_test_split

X = df_amazon['verified_reviews'] # the features we want to analyze
ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

Now that we’re all set up, it’s time to actually build our model! We’ll start by importing the LogisticRegression module and creating a LogisticRegression classifier object.

Then, we’ll create a pipeline with three components: a cleaner, a vectorizer, and a classifier. The cleaner uses our predictors class object to clean and preprocess the text. The vectorizer uses countvector objects to create the bag of words matrix for our text. The classifier is an object that performs the logistic regression to classify the sentiments.

Once this pipeline is built, we’ll fit the pipeline components using fit().

In [124]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train, y_train)

In [125]:
from sklearn import metrics

# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.944973544973545
Logistic Regression Precision: 0.9496166484118291
Logistic Regression Recall: 0.993127147766323


Other NLP tools continued in next notebook.