# Using off the shelf NLP tools with Spacy

This notebook shows how to use some off the shelf NLP models for doing stuff like identifying key phrases, named entities, part of speech tagging, etc.  


Based on: ["Applied Language Technology MOOC"](https://applied-language-technology.mooc.fi/html/notebooks/part_ii/03_basic_nlp.html)

Let us start with installing two useful text processing libraries - spacy and textacy. Textacy is built on top of spacy, to add a few more NLP functionalities.

In [1]:
!pip install spacy
!pip install textacy

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


# Spacy

I will start with Spacy first. To do anything with Spacy, you have to download the respective language model, which is already trained for performing several functions. I am downloading an English model. Full list for all supported languages is at: Full list at: https://spacy.io/models 

In [2]:
#Download the required Spacy model. 
import spacy.cli
spacy.cli.download("en_core_web_trf")

  from .autonotebook import tqdm as notebook_tqdm


Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-trf==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl (460.3 MB)


ERROR: Operation cancelled by user
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.


KeyboardInterrupt: 

In [3]:
#Now, load the spacy model. 
import spacy
nlp = spacy.load('en_core_web_trf')

We have to convert any given string into spacy's document format to use its functions. nlp that we defined in the above line, does that for us.

In [4]:
text = "Ludwig Maximilian University of Munich (also referred to as LMU or simply as the University of Munich; German: Ludwig-Maximilians-Universität München) is a public research university located in Munich, Germany, and is the country's sixth-oldest university in continuous operation."
doc = nlp(text)



In [5]:
#Now, let us look at Spacy's tokenization for this text:
for token in doc:
    print(token)

Ludwig
Maximilian
University
of
Munich
(
also
referred
to
as
LMU
or
simply
as
the
University
of
Munich
;
German
:
Ludwig
-
Maximilians
-
Universität
München
)
is
a
public
research
university
located
in
Munich
,
Germany
,
and
is
the
country
's
sixth
-
oldest
university
in
continuous
operation
.


In [6]:
#Getting the part of speech tags for individual tokens
for token in doc:
    # Print the token and the POS tags
    print(token, token.pos_, token.tag_)

Ludwig PROPN NNP
Maximilian PROPN NNP
University PROPN NNP
of ADP IN
Munich PROPN NNP
( PUNCT -LRB-
also ADV RB
referred VERB VBN
to ADP IN
as ADP IN
LMU PROPN NNP
or CCONJ CC
simply ADV RB
as ADP IN
the DET DT
University PROPN NNP
of ADP IN
Munich PROPN NNP
; PUNCT :
German ADJ JJ
: PUNCT :
Ludwig PROPN NNP
- PUNCT HYPH
Maximilians PROPN NNP
- PUNCT HYPH
Universität PROPN NNP
München PROPN NNP
) PUNCT -RRB-
is AUX VBZ
a DET DT
public ADJ JJ
research NOUN NN
university NOUN NN
located VERB VBN
in ADP IN
Munich PROPN NNP
, PUNCT ,
Germany PROPN NNP
, PUNCT ,
and CCONJ CC
is AUX VBZ
the DET DT
country NOUN NN
's PART POS
sixth ADV RB
- PUNCT HYPH
oldest ADJ JJS
university NOUN NN
in ADP IN
continuous ADJ JJ
operation NOUN NN
. PUNCT .


In [7]:
# Print the token and the results of morphological analysis
for token in doc:
    print(token, token.morph)

Ludwig Number=Sing
Maximilian Number=Sing
University Number=Sing
of 
Munich Number=Sing
( PunctSide=Ini|PunctType=Brck
also 
referred Aspect=Perf|Tense=Past|VerbForm=Part
to 
as 
LMU Number=Sing
or ConjType=Cmp
simply 
as 
the Definite=Def|PronType=Art
University Number=Sing
of 
Munich Number=Sing
; 
German Degree=Pos
: 
Ludwig Number=Sing
- PunctType=Dash
Maximilians Number=Sing
- PunctType=Dash
Universität Number=Sing
München Number=Sing
) PunctSide=Fin|PunctType=Brck
is Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
a Definite=Ind|PronType=Art
public Degree=Pos
research Number=Sing
university Number=Sing
located Aspect=Perf|Tense=Past|VerbForm=Part
in 
Munich Number=Sing
, PunctType=Comm
Germany Number=Sing
, PunctType=Comm
and ConjType=Cmp
is Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
the Definite=Def|PronType=Art
country Number=Sing
's 
sixth 
- PunctType=Dash
oldest Degree=Sup
university Number=Sing
in 
continuous Degree=Pos
operation Number=Sing
. PunctType=Per

In [11]:
#Get the per token morphological information
#doc[7] is the word "referred" in our text
print(doc[7])
print(doc[7].morph.to_dict())

referred
{'Aspect': 'Perf', 'Tense': 'Past', 'VerbForm': 'Part'}


In [12]:
#View the syntactic parse tree of the sentence to see relations between words
from spacy import displacy
displacy.render(doc, style='dep', options={'compact': True})


In [14]:
# Loop over sentences in the Doc object and count them using enumerate()
# We have only one sentence in our doc, though. 
for number, sent in enumerate(doc.sents):    
    print(number, sent)

0 Ludwig Maximilian University of Munich (also referred to as LMU or simply as the University of Munich; German: Ludwig-Maximilians-Universität München) is a public research university located in Munich, Germany, and is the country's sixth-oldest university in continuous operation.


In [13]:
# Print the token and its lemma
for token in doc:
    print(token, token.lemma_)

Ludwig Ludwig
Maximilian Maximilian
University University
of of
Munich Munich
( (
also also
referred refer
to to
as as
LMU LMU
or or
simply simply
as as
the the
University University
of of
Munich Munich
; ;
German german
: :
Ludwig Ludwig
- -
Maximilians Maximilians
- -
Universität Universität
München München
) )
is be
a a
public public
research research
university university
located locate
in in
Munich Munich
, ,
Germany Germany
, ,
and and
is be
the the
country country
's 's
sixth sixth
- -
oldest old
university university
in in
continuous continuous
operation operation
. .


In [15]:
# Loop over the named entities in the Doc object 
for ent in doc.ents:
    # Print the named entity and its label
    print(ent.text, ent.label_)

Ludwig Maximilian University of Munich ORG
LMU ORG
the University of Munich ORG
German NORP
Ludwig-Maximilians-Universität München ORG
Munich GPE
Germany GPE
sixth ORDINAL


In [16]:
displacy.render(doc, style='ent')


In [17]:
# Get the noun chunks in the doc.
for item in doc.noun_chunks:
    print(item)

Ludwig Maximilian University
Munich
LMU
the University
Munich
German: Ludwig-Maximilians-Universität München
a public research university
Munich
Germany
the country's sixth-oldest university
continuous operation


# Textacy

ref: https://textacy.readthedocs.io/en/latest/quickstart.html


In [18]:
#import it!
import textacy

In [27]:
text2 = """
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.[2]

Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks, convolutional neural networks and Transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance
"""

In [28]:
text2

'\nDeep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.[2]\n\nDeep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks, convolutional neural networks and Transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance\n'

In [29]:
#we still have to convert to a spacy doc, even when using textacy.
# so we have to load a spacy model first and then use it to convert.
en = textacy.load_spacy_lang("en_core_web_trf", disable=("parser",))
tdoc = textacy.make_spacy_doc(text2, en)

In [30]:
list(textacy.extract.ngrams(tdoc, 3, filter_stops=True, filter_punct=True, filter_nums=False))

[known as deep,
 deep structured learning,
 family of machine,
 machine learning methods,
 learning methods based,
 based on artificial,
 artificial neural networks,
 networks with representation,
 supervised or unsupervised.[2,
 deep neural networks,
 deep belief networks,
 deep reinforcement learning,
 recurrent neural networks,
 convolutional neural networks,
 networks and Transformers,
 applied to fields,
 fields including computer,
 including computer vision,
 natural language processing,
 medical image analysis,
 inspection and board,
 board game programs,
 produced results comparable,
 cases surpassing human,
 surpassing human expert,
 human expert performance]

## key phrase extraction

Textacy supports several key phrase extraction algorithms. 

In [32]:
from textacy.extract import keyterms as kt
kt.textrank(tdoc, normalize="lemma", topn=10)

[('deep structured learning', 0.09440093989778794),
 ('deep reinforcement learning', 0.09440093989778794),
 ('deep learning', 0.08618248103400335),
 ('deep neural network', 0.07591186832866487),
 ('machine learning method', 0.07374265985118073),
 ('deep belief network', 0.06477243188699044),
 ('learning architecture', 0.05794759829539524),
 ('representation learning', 0.05769656963325219),
 ('artificial neural network', 0.0481060162049462),
 ('recurrent neural network', 0.046970691086292024)]

check textacy's documentation to know more. It also supports a few other languages, and key phrase extraction could be a very useful function to know about.