# SpaCy and NLTK

NLP Operations in Python
Two packages:
 1. NLTK (Natural Language Processing Toolkit)
 2. Spacy

To install spacy in local machine refer the below link

https://spacy.io/usage

In [1]:
!pip install spacy
!pip install nltk



In [2]:
#Download model (English Core Model - Small Version)
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Text Preprocessing Steps:
# 1. Data Acquisition
# 2. Data Cleaning
# 3. Data Normalization
# 4. Tokenization
# 5. Stopwords Removal
# 6. POS Tagging
# 7. NER
# 8. Stemming and Lemma

In [3]:
import spacy
import nltk

In [4]:
#Check if Spacy is working properly
import spacy
model = spacy.load('en_core_web_sm')

## POS Tagging

In [5]:
# POS or parts of speech tagging is a process of NLP where each word in the given text is assigned with the corresponding part of speech

# NN - Noun
# VB - Verb
# JJ - Adjective
# RB - Adverb
# PRP - Pronoun
# IN - preposition
# CC - Coordinating Conjunction
# DT - Determiner

# Why we need POS Tagging?
# POS tagging helps to understand the gramatical structure of the sentence and can be used in
#    a. Machine Translation
#    b. Text Analytics.
#    c. Information Retrieval

# Different POS Tags and Meaning Reference: 
# https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

In [6]:
textExample = u"Tesla is looking at buying U.S. startups for $6 million"
#Pos tagging Demo
#
# Rule: Ensure the string data that need to be loaded in spacy model object must follow unicode standard.

data = model(textExample)

for record in data:
  print(record.text , record.pos_)

Tesla PROPN
is AUX
looking VERB
at ADP
buying VERB
U.S. PROPN
startups NOUN
for ADP
$ SYM
6 NUM
million NUM


In [7]:
#POS Tagging using NLTK
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [8]:
textExample = u"Tesla is looking at buying U.S. startups for $6 million"
words = nltk.word_tokenize(textExample)
print("Extracted Words are: \n\n",words)
posTags = nltk.pos_tag(words)
print(posTags)

Extracted Words are: 

 ['Tesla', 'is', 'looking', 'at', 'buying', 'U.S.', 'startups', 'for', '$', '6', 'million']
[('Tesla', 'NNP'), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), ('U.S.', 'NNP'), ('startups', 'NN'), ('for', 'IN'), ('$', '$'), ('6', 'CD'), ('million', 'CD')]


#Tokenization

# Tokenization
Converting words to numbers

a. Word tokenization (Used extensively in Pattern/Logic Extraction) (e.g. ML/DL model for classification or clustering)

b. Sentence Tokenization (Used for machine/language translation)

In [9]:
#Word Tekenization

textExample.split(" ")

['Tesla',
 'is',
 'looking',
 'at',
 'buying',
 'U.S.',
 'startups',
 'for',
 '$6',
 'million']

In [10]:
data = model(textExample)
print("Word Tokens for the given sentence are: \n\n")
for record in data:
  print(record.text )

Word Tokens for the given sentence are: 


Tesla
is
looking
at
buying
U.S.
startups
for
$
6
million


In [11]:
#Sentence Tokenization

textData = u"Welcome to Simplilearn. I am Prashant Nair and I will be your Instructor. Lets learn NLP"

data = model(textData)

[sent for sent in data.sents]

[Welcome to Simplilearn.,
 I am Prashant Nair and I will be your Instructor.,
 Lets learn NLP]

In [12]:
#Download the necessary tokenizer models
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
#For Word Tokenization
from nltk.tokenize import word_tokenize
textData = u"Welcome to Simplilearn. I am Prashant Nair and I will be your Instructor. Lets learn NLP"
sentences = word_tokenize(textData)
sentences

['Welcome',
 'to',
 'Simplilearn',
 '.',
 'I',
 'am',
 'Prashant',
 'Nair',
 'and',
 'I',
 'will',
 'be',
 'your',
 'Instructor',
 '.',
 'Lets',
 'learn',
 'NLP']

In [14]:
#For Sentence Tokenization

from nltk.tokenize import sent_tokenize
textData = u"Welcome to Simplilearn. I am Prashant Nair and I will be your Instructor. Lets learn NLP"
sentences = sent_tokenize(textData)
sentences

['Welcome to Simplilearn.',
 'I am Prashant Nair and I will be your Instructor.',
 'Lets learn NLP']

## NER (Named Entity Recognition)

In [15]:
# NER is used in NLP apps to identify and categorize entities in the text data into pre-defined categories
# like Person, Organization etc.
#
# Scenarios where NER is used extensively:
# 1. Information Extraction: Extract critical or need info like company name, finanicial figures, dates, etc.
# 2. Search Engines: To improve ther search relevance for recognizing relevant articles.
# 3. QA systems: To provide accuracte answers to questions
# etc.

#NER Category Reference: https://learn.microsoft.com/en-us/azure/ai-services/language-service/named-entity-recognition/concepts/named-entity-categories?tabs=ga-api

In [16]:
#NER using Spacy

sentence = u"Apple to build a factory in Hong Kong and Mumbai with an initial investment of $100 million in collaboration with Microsoft and Simplilearn. Lets try 123-45-6789"

data = model(sentence)

for entry in data.ents:
  print(f"Token is : {entry.text}, NER detected:  {entry.label_}, Explain : {spacy.explain(entry.label_)}")

Token is : Apple, NER detected:  ORG, Explain : Companies, agencies, institutions, etc.
Token is : Hong Kong, NER detected:  GPE, Explain : Countries, cities, states
Token is : Mumbai, NER detected:  GPE, Explain : Countries, cities, states
Token is : $100 million, NER detected:  MONEY, Explain : Monetary values, including unit
Token is : Microsoft, NER detected:  ORG, Explain : Companies, agencies, institutions, etc.
Token is : Simplilearn, NER detected:  ORG, Explain : Companies, agencies, institutions, etc.
Token is : 123, NER detected:  CARDINAL, Explain : Numerals that do not fall under another type


In [17]:
from spacy import displacy
displacy.render(data, jupyter=True,style="ent")

In [18]:
model.get_pipe("ner").labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [19]:
#NER Model
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [20]:
#NER in NLTK

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk

sentence = u"Apple to build a factory in Hong Kong and Mumbai with an initial investment of $100 million in collaboration with Microsoft and Simplilearn. Lets try 123-45-6789"

words = word_tokenize(sentence)

posTag = pos_tag(words)

nerData = ne_chunk(posTag)

print(nerData)

(S
  (GPE Apple/NNP)
  to/TO
  build/VB
  a/DT
  factory/NN
  in/IN
  (GPE Hong/NNP Kong/NNP)
  and/CC
  (PERSON Mumbai/NNP)
  with/IN
  an/DT
  initial/JJ
  investment/NN
  of/IN
  $/$
  100/CD
  million/CD
  in/IN
  collaboration/NN
  with/IN
  (PERSON Microsoft/NNP)
  and/CC
  (GPE Simplilearn/NNP)
  ./.
  Lets/NNS
  try/VBP
  123-45-6789/JJ)


# Public datasets:

www.kaggle.com/datasets

cloud.google.com/bigquery/public-data

Others:
    https://cloud.google.com/healthcare-api/docs/resources/public-datasets/nih-chest