## Language, Spacy 

In [1]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_md')

#### Natural Language Processing
* spacy uses an underlying model of the english language
* not designed for poetic text - web models statistically trained on text in things like wikipedia or google news
* english is the most researched language, hard to find tools in other languages at the same quality
* NLProc

#### English grammar
* sentences, words and parts of speech
* part of speech is contextual, not a property of the word itself
* nlp models have a statistical understanding of this context
* dependency grammar

In [4]:
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status. Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.")

get list of all sentences

this.sents is an iterator

In [10]:
list(doc.sents)

[All human beings are born free and equal in dignity and rights.,
 They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.,
 Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status.,
 Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.]

In [11]:
for item in doc.sents:
    print(item.text)

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status.
Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.


In [12]:
sentences = [item.text for item in doc.sents]

In [13]:
import random
random.choice(sentences)

'They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.'

In [16]:
words = [item.text for item in doc]

these are techinically 'tokens', generalized pieces of a document, e.g. words and punctuation 

In [19]:
for item in doc:
    print(item.text, item.lemma_)

All all
human human
beings being
are be
born bear
free free
and and
equal equal
in in
dignity dignity
and and
rights right
. .
They -PRON-
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .
Everyone everyone
is be
entitled entitle
to to
all all
the the
rights right
and and
freedoms freedom
set set
forth forth
in in
this this
Declaration declaration
, ,
without without
distinction distinction
of of
any any
kind kind
, ,
such such
as as
race race
, ,
colour colour
, ,
sex sex
, ,
language language
, ,
religion religion
, ,
political political
or or
other other
opinion opinion
, ,
national national
or or
social social
origin origin
, ,
property property
, ,
birth birth
or or
other other
status status
. .
Furthermore furthermore
, ,
no no
distinction distinction
shall shall
be be
made make
on on
the the
basis basis
of of
the the
political p

lemmatization - removing all inflections

* cats -> cat
* running -> run
* are -> be
* geese -> goose

derivation and morphology

In [20]:
for item in doc:
    print(item.text, item.pos_, item.tag_)

All DET DT
human ADJ JJ
beings NOUN NNS
are VERB VBP
born VERB VBN
free ADJ JJ
and CCONJ CC
equal ADJ JJ
in ADP IN
dignity NOUN NN
and CCONJ CC
rights NOUN NNS
. PUNCT .
They PRON PRP
are VERB VBP
endowed VERB VBN
with ADP IN
reason NOUN NN
and CCONJ CC
conscience NOUN NN
and CCONJ CC
should VERB MD
act VERB VB
towards ADP IN
one NUM CD
another DET DT
in ADP IN
a DET DT
spirit NOUN NN
of ADP IN
brotherhood NOUN NN
. PUNCT .
Everyone NOUN NN
is VERB VBZ
entitled VERB VBN
to ADP IN
all ADJ PDT
the DET DT
rights NOUN NNS
and CCONJ CC
freedoms NOUN NNS
set VERB VBN
forth ADV RB
in ADP IN
this DET DT
Declaration PROPN NNP
, PUNCT ,
without ADP IN
distinction NOUN NN
of ADP IN
any DET DT
kind NOUN NN
, PUNCT ,
such ADJ JJ
as ADP IN
race NOUN NN
, PUNCT ,
colour NOUN NN
, PUNCT ,
sex NOUN NN
, PUNCT ,
language NOUN NN
, PUNCT ,
religion NOUN NN
, PUNCT ,
political ADJ JJ
or CCONJ CC
other ADJ JJ
opinion NOUN NN
, PUNCT ,
national ADJ JJ
or CCONJ CC
social ADJ JJ
origin NOUN NN
, PUNCT ,
prope

In [21]:
[item.text for item in doc if item.tag_ == "NNS"]

['beings', 'rights', 'rights', 'freedoms']

In [22]:
[item.text for item in doc if item.pos_ == "NOUN"]

['beings',
 'dignity',
 'rights',
 'reason',
 'conscience',
 'spirit',
 'brotherhood',
 'Everyone',
 'rights',
 'freedoms',
 'distinction',
 'kind',
 'race',
 'colour',
 'sex',
 'language',
 'religion',
 'opinion',
 'origin',
 'property',
 'birth',
 'status',
 'distinction',
 'basis',
 'status',
 'country',
 'territory',
 'person',
 'trust',
 'self',
 'governing',
 'limitation',
 'sovereignty']

In [24]:
nouns = [item.text for item in doc if item.pos_ == 'NOUN']
adjectives = [item.text for item in doc if item.pos_ == 'ADJ']

In [25]:
for i in range(10):
    print(random.choice(adjectives) + " " + random.choice(nouns))

human spirit
such Everyone
political status
national conscience
all freedoms
political language
jurisdictional spirit
free self
other brotherhood
such property


In [26]:
[item.text for item in doc.noun_chunks]

['All human beings',
 'dignity',
 'rights',
 'They',
 'reason',
 'conscience',
 'a spirit',
 'brotherhood',
 'Everyone',
 'all the rights',
 'freedoms',
 'this Declaration',
 'distinction',
 'any kind',
 'race',
 'colour',
 'sex',
 'language',
 'religion',
 'opinion',
 'national or social origin',
 'property',
 'birth',
 'other status',
 'no distinction',
 'the basis',
 'the political, jurisdictional or international status',
 'the country',
 'territory',
 'a person',
 'it',
 'any other limitation',
 'sovereignty']

In [28]:
my_sentence = list(doc.sents)[1]

In [33]:
for item in my_sentence:
    print(item.text, item.tag_, item.head.text, item.dep_, list(item.children), list(item.subtree) )

They PRP endowed nsubjpass [] [They]
are VBP endowed auxpass [] [are]
endowed VBN endowed ROOT [They, are, with, and, act, .] [They, are, endowed, with, reason, and, conscience, and, should, act, towards, one, another, in, a, spirit, of, brotherhood, .]
with IN endowed prep [reason] [with, reason, and, conscience]
reason NN with pobj [and, conscience] [reason, and, conscience]
and CC reason cc [] [and]
conscience NN reason conj [] [conscience]
and CC endowed cc [] [and]
should MD act aux [] [should]
act VB endowed conj [should, towards, in] [should, act, towards, one, another, in, a, spirit, of, brotherhood]
towards IN act prep [one] [towards, one, another]
one CD towards pobj [another] [one, another]
another DT one det [] [another]
in IN act prep [spirit] [in, a, spirit, of, brotherhood]
a DT spirit det [] [a]
spirit NN in pobj [a, of] [a, spirit, of, brotherhood]
of IN spirit prep [brotherhood] [of, brotherhood]
brotherhood NN of pobj [] [brotherhood]
. . endowed punct [] [.]


In [32]:
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

In [35]:
for item in my_sentence:
    print(item.text, "/",flatten_subtree(item.subtree))

They / They
are / are
endowed / They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
with / with reason and conscience
reason / reason and conscience
and / and
conscience / conscience
and / and
should / should
act / should act towards one another in a spirit of brotherhood
towards / towards one another
one / one another
another / another
in / in a spirit of brotherhood
a / a
spirit / a spirit of brotherhood
of / of brotherhood
brotherhood / brotherhood
. / .


lets you get all the prepositional phrases

In [37]:
for word in doc:
    if word.dep_ == 'prep':
        print(flatten_subtree(word.subtree))

in dignity and rights
with reason and conscience
towards one another
in a spirit of brotherhood
of brotherhood
to all the rights and freedoms set forth in this Declaration
in this Declaration
without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status
of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status
such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status
on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs
of the political, jurisdictional or international status of the country or territory to which a person belongs
of the country or territory to which a person belongs
to which
of sovereignty


In [38]:
for word in doc:
    if word.dep_ == 'nsubj':
        print(flatten_subtree(word.subtree))

a person
it


In [40]:
for word in doc:
    if word.dep_ == 'nsubj' or word.dep_ == 'nsubjpass':
        print(flatten_subtree(word.subtree))

All human beings
They
Everyone
no distinction
a person
it


In [41]:
list(doc.ents)

[]

### parsing from a file

In [43]:
doc2 = nlp(open("../3Week/plaintext-example-files/genesis.txt").read())

In [44]:
[item.text for item in doc2.ents if item.label_ == "PERSON"]

['God', 'God', 'Night', 'Behold']

In [45]:
for item in doc2.ents:
    print (item.text, item.label_)

earth LOC
earth LOC
God PERSON
God PERSON
Night PERSON
the evening and the morning TIME
the first day DATE
the second day DATE
one CARDINAL
Earth LOC
earth LOC
earth LOC
the third day DATE
the day DATE
seasons DATE
days DATE
years DATE
earth LOC
two CARDINAL
the day DATE
the night TIME
the day DATE
the night TIME
the evening and the morning TIME
the fourth day DATE
the fifth day DATE
Behold PERSON
earth LOC
earth LOC
earth LOC
the sixth day DATE
