# NLP using spaCy

Notes/ Semistructured snippets on using spaCy library for doing basic NLP operations like tokenization, POS tagging, NER etc

further reading

https://spacy.io/docs/usage/tutorials
https://github.com/cytora/pycon-nlp-in-10-lines/blob/master/01_pride_and_predjudice.ipynb
https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017/blob/master/advanced-text-analysis.ipynb

In [24]:
import pandas as pd
from spacy.en import English
nlp = English()

In [25]:
text = ("This is a bunch of text. It contains a few sentences, all together.Your basic, run-of-the-mill paragraph.")

In [36]:
doc = nlp(text)

In [27]:
for sentence in doc.sents:
    print(sentence)

This is a bunch of text.
It contains a few sentences, all together.
Your basic, run-of-the-mill paragraph.


In [28]:
#POS Tagging

for token in doc:
    print('{} - {}'.format(token, token.pos_))

This - DET
is - VERB
a - DET
bunch - NOUN
of - ADP
text - NOUN
. - PUNCT
It - PRON
contains - VERB
a - DET
few - ADJ
sentences - NOUN
, - PUNCT
all - DET
together - ADV
. - PUNCT
Your - ADJ
basic - ADJ
, - PUNCT
run - VERB
- - PUNCT
of - ADP
- - PUNCT
the - DET
- - PUNCT
mill - NOUN
paragraph - NOUN
. - PUNCT


In [78]:
#NER
#Note doest detect NER if it doesnt start with upper case
doc_2 = nlp(u"I went to Paris where I met my old friend Jack from my university, who now works for Microsoft. He recently started learning French")
for ent in doc_2.ents:
    print('{} - {}'.format(ent, ent.label_))

Paris - GPE
Jack - PERSON
Microsoft - ORG
French - NORP


In [30]:
#Noun chunking

doc_3 = nlp(u"The boy saw the yellow dog")
for chunk in doc_3.noun_chunks:
    print(chunk)

The boy
the yellow dog


In [31]:
nounphrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]

In [38]:

sentences = [sentence.orth_ for sentence in doc.sents]
print("There were {} sentences found. Here's a sample:".format(len(sentences)))
pd.DataFrame(sentences[1:3])

There were 3 sentences found. Here's a sample:


Unnamed: 0,0
0,"It contains a few sentences, all together."
1,"Your basic, run-of-the-mill paragraph."


In [40]:
nounphrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
print("There were {} noun phrases found. Here's a sample:".format(len(nounphrases)))
pd.DataFrame(nounphrases[1:4])

There were 5 noun phrases found. Here's a sample:


Unnamed: 0,0,1
0,text,of
1,It,contains
2,a few sentences,contains


In [44]:
entities = list(doc_2.ents)
print("There were {} entities found".format(len(entities)))

There were 4 entities found


In [45]:
orgs_and_people = [entity.orth_ for entity in entities if entity.label_ in ['ORG','PERSON']]
pd.DataFrame(orgs_and_people)

Unnamed: 0,0
0,Jack
1,Microsoft


In [46]:
for token in doc_3:
    print(token, token.lemma, token.lemma_)

The 501 the
boy 2729 boy
saw 678 see
the 501 the
yellow 6056 yellow
dog 1847 dog


In [47]:
for token in doc_3:
    print(token, token.pos, token.pos_)

The 88 DET
boy 90 NOUN
saw 98 VERB
the 88 DET
yellow 82 ADJ
dog 90 NOUN
