# NLP Unit 4

Unit 4 has the following parts
* Finish Word Embeddings from unit 3: start at slide 30
* Look at slides about important basic NLP tasks, such as POS-tagging, NER, parsing etc using these slides: https://www.slideshare.net/GirishKhanzode/nlp-52218202
* In combination with the slides we will also look at spaCy to practice things
* **this unit will be organized from this ipynb notebook, which we use back and forth with slides**

# spaCy

* spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
* NLTK is guided more towards education, and is quite old.
* spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.


## Resources:
### Resource 1: Web site:  https://spacy.io/
### Resource 2: Basic usage and functions: https://spacy.io/usage/spacy-101
### Resource 3: Course:  very nice course found here: https://course.spacy.io/

## Installation:

In [None]:
# if not installed yet
!pip3 install spacy
!python3 -m spacy download en_core_web_md

In [1]:
import spacy
nlp = spacy.load("en_core_web_md")

<font color='blue'>


### Slides: Basic NLP (repetition) 2-10
While the installation is running, let's look at the slides: 2-10</font>
    


In [2]:
# analyze a piece of text with the model
doc = nlp(u'This is a sentence, which has two parts.')
print(doc.text)
print(doc.lang_)
print(list(doc.sents))


This is a sentence, which has two parts.
en
[This is a sentence, which has two parts.]


### Let's just look at the spaCy course: https://course.spacy.io/ Chapter 1

<font color="red">

### First small exercise:
* create an example sentence
* create a spaCy object
* iterate of tokens, show part-of-speech tags, if the token is alpha-numeric, etc



## spaCy simple example (dependence parse, pos, ..)

In [3]:
docX = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in docX:
    print(token.text, token.pos_, token.dep_, list(token.ancestors), token.lemma_)

Apple PROPN nsubj [looking] Apple
is AUX aux [looking] be
looking VERB ROOT [] look
at ADP prep [looking] at
buying VERB pcomp [at, looking] buy
U.K. PROPN compound [startup, buying, at, looking] U.K.
startup NOUN dobj [buying, at, looking] startup
for ADP prep [buying, at, looking] for
$ SYM quantmod [billion, for, buying, at, looking] $
1 NUM compound [billion, for, buying, at, looking] 1
billion NUM pobj [for, buying, at, looking] billion


### Spacy can be used with many languages. In the simple way shown below we can use it without statistical ML  models. How to use a better model for Russian, see below.

<font color='red'> maybe these simple models only do tokenization and other simple rule-based operations

In [4]:
from spacy.lang.ru import Russian
doc = nlp("Привет Миру! Как твои дела? Сегодня неплохая погода.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_alpha, token.is_punct)

print("")

from spacy.lang.de import German
doc = nlp("Guten Tag, wie heissen Sie denn?")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_alpha, token.is_punct)


Привет Привет PROPN True False
Миру Миру PROPN True False
! ! PUNCT False True
Как Как PROPN True False
твои твои PROPN True False
дела дела PROPN True False
? ? PUNCT False True
Сегодня Сегодня PROPN True False
неплохая неплохая PROPN True False
погода погода PROPN True False
. . PUNCT False True

Guten Guten PROPN True False
Tag Tag PROPN True False
, , PUNCT False True
wie wie PROPN True False
heissen heissen PROPN True False
Sie Sie PROPN True False
denn denn PROPN True False
? ? PUNCT False True


<font color='blue'>
    
# Let's look at the next part of the slides: 11-20

## stopped here!!
## stopped at slide 18! -- continue with parsing next time!


## Let's look at some spaCy features in detail

### Lemmatization

In [5]:
# lemmatization example
docX = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in docX:
    print(token.text, token.lemma_)

Apple Apple
is be
looking look
at at
buying buy
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion


### Part of Speech, Stopwords, is_alpha, ...

In [6]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP Xxxxx True False
is be AUX VBZ xx True True
looking look VERB VBG xxxx True False
at at ADP IN xx True True
buying buy VERB VBG xxxx True False
U.K. U.K. PROPN NNP X.X. False False
startup startup NOUN NN xxxx True False
for for ADP IN xxx True True
$ $ SYM $ $ False False
1 1 NUM CD d False False
billion billion NUM CD xxxx True False


In [7]:
print(spacy.explain("PROPN"))
print(spacy.explain("CD"))

proper noun
cardinal number


### chunking and parsing

<font color='blue'>
    
# Let's look at the dependency parsing slideset


In [8]:
print("Simple chunking:")
list(doc.noun_chunks)

Simple chunking:


[Apple, U.K. startup]

In [9]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Apple PROPN nsubj looking
is AUX aux looking
looking VERB ROOT looking
at ADP prep looking
buying VERB pcomp at
U.K. PROPN compound startup
startup NOUN dobj buying
for ADP prep buying
$ SYM quantmod billion
1 NUM compound billion
billion NUM pobj for


In [10]:
print(spacy.explain("nsubj"))
print(spacy.explain("aux"))
print(spacy.explain("prep"))
print(spacy.explain("pcomp"))

nominal subject
auxiliary
prepositional modifier
complement of preposition


<font color='blue'>
    
# Let's look at the next part of the slides: 
* Semantics: 20-21
* Pragmatics: 24-25
* Challenges: 26-30
* POS: 47-49
* NER: 50-51


### Named Entities

In [11]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [12]:
spacy.explain("GPE")

'Countries, cities, states'

### Some internals, OOV?, the vector

In [14]:
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print("\n", token.text, token.has_vector, token.vector_norm, token.is_oov)
    print("Length of vector:", len(token.vector), token.vector[:10]) # show only first part of vector


 dog True 7.0336733 False
Length of vector: 300 [-0.40176   0.37057   0.021281 -0.34125   0.049538  0.2944   -0.17376
 -0.27982   0.067622  2.1693  ]

 cat True 6.6808186 False
Length of vector: 300 [-0.15067  -0.024468 -0.23368  -0.23378  -0.18382   0.32711  -0.22084
 -0.28777   0.12759   1.1656  ]

 banana True 6.700014 False
Length of vector: 300 [ 0.20228  -0.076618  0.37032   0.032845 -0.41957   0.072069 -0.37476
  0.05746  -0.012401  0.52949 ]

 afskfsd False 0.0 True
Length of vector: 300 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


<font color="red">

### Exercise 2: using basic named entities and dependencies
* Take the text from a random Wikipedia Artikel (copy and paste)
* Split the article into sentences with NLTK
* Print all persons named in the article
* Print all locations named in the article
* Print person names that are subjects (nsubj) according to the dependency parser of the sentences.
* Let's find out what the sentences talk about, which include both a location and a person. What are the
    ROOTs (according to the dependency parser) of those sentences?


In [15]:
## download a random book -- take first 200 sentences
import urllib  # the lib that handles the url stuff
url = "https://raw.githubusercontent.com/NSkelsey/cvf/master/war_and_peace.txt"
data = urllib.request.urlopen(url) # it's a file like object and works just like a file
text = [line.decode('utf-8') for line in data]
text = "".join(text[:200]) # first 100K characters of the book
print(text[:1000])

﻿The Project Gutenberg EBook of War and Peace, by Leo Tolstoy

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: War and Peace

Author: Leo Tolstoy

Posting Date: January 10, 2009 [EBook #2600]
Release Date: April, 2001
[Last updated: August 22, 2012]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE ***




An Anonymous Volunteer





WAR AND PEACE

By Leo Tolstoy/Tolstoi





BOOK ONE: 1805





CHAPTER I


"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist--I really believe he is Antichrist--I will have nothing more
to do with you an

In [16]:
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)




﻿The Project Gutenberg EBook ORG
War and Peace WORK_OF_ART
Leo Tolstoy PERSON
the Project ORG
eBook ORG
War and Peace

 WORK_OF_ART
Leo Tolstoy PERSON
January 10, 2009 DATE
2600 MONEY
April, 2001 DATE
August 22, 2012 DATE
English LANGUAGE
Leo Tolstoy PERSON
Tolstoi





 PERSON
ONE CARDINAL
1805





 DATE
Genoa GPE
Lucca GPE
the
Buonapartes PERSON
Antichrist PERSON
July, 1805 DATE
Anna Pavlovna
Scherer PERSON
Marya Fedorovna PERSON
Prince Vasili Kuragin PERSON
first ORDINAL
Anna Pavlovna
 PERSON
some days DATE
la
 ORG
St. Petersburg GPE
French LANGUAGE
an evening TIME
tonight TIME
7 CARDINAL
10 CARDINAL
Heavens WORK_OF_ART
French NORP
Anna Pavlovna PERSON
First ORDINAL
one CARDINAL
one CARDINAL
Anna Pavlovna PERSON
English NORP
Wednesday DATE
today DATE
Novosiltsev ORG
Buonaparte PERSON
Anna Pavlovna Scherer PERSON
forty years DATE
Anna Pavlovna PERSON
Austria GPE
Austria GPE
Russia GPE
Europe LOC
one CARDINAL
earth LOC
England GPE
Alexander PERSON
Malta GPE
English 

## Visualize entities and dependencies!

In [19]:
# visualize entities

from spacy import displacy

doc_ent = nlp(u'When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.')

displacy.render([doc_ent, docX], style='ent', jupyter=True)


In [20]:
# visualize entities

from spacy import displacy


doc_ent = nlp(u'When Sebastian Thrun started working on self-driving cars at Google '
u'in 2007, few people outside of the company took him seriously.')
displacy.render([doc_ent, docX], style='dep', jupyter=True)

## Token, Span and Doc similarity

Doc and Span vectors default to average of token vectors


In [46]:
# Determine semantic similarities
doc1 = nlp(u'the fries were gross')
doc2 = nlp(u'worst fries ever')
doc1.similarity(doc2)

# Hook in your own deep learning models

0.7791662980402656

In [47]:
# similarity between 2 words
doc1a = nlp(u'large')
doc2a = nlp(u'small')
doc1a.similarity(doc2a)

0.8343904257366469

In [None]:
# similarity on the basis of tokens

doc = nlp(u"Apple and banana are similar. Pasta and hippo aren't.")
apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

assert apple.similarity(banana) > pasta.similarity(hippo)
assert apple.has_vector

print(apple.vector)


In [49]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.32531983166759537


In [50]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


## spacy with Russian language

### taken from: https://github.com/buriy/spacy-ru


In [90]:
!pip3 install pymorphy2==0.8

Collecting pymorphy2==0.8
[?25l  Downloading https://files.pythonhosted.org/packages/a3/33/fff9675c68b5f6c63ec8c6e6ff57827dda28a1fa5b2c2d727dffff92dd47/pymorphy2-0.8-py2.py3-none-any.whl (46kB)
[K    100% |████████████████████████████████| 51kB 721kB/s ta 0:00:01
[?25hCollecting pymorphy2-dicts<3.0,>=2.4 (from pymorphy2==0.8)
[?25l  Downloading https://files.pythonhosted.org/packages/02/51/2465fd4f72328ab50877b54777764d928da8cb15b74e2680fc1bd8cb3173/pymorphy2_dicts-2.4.393442.3710985-py2.py3-none-any.whl (7.1MB)
[K    100% |████████████████████████████████| 7.1MB 182kB/s ta 0:00:011  2% |▊                               | 153kB 2.6MB/s eta 0:00:03
[?25hCollecting docopt>=0.6 (from pymorphy2==0.8)
  Downloading https://files.pythonhosted.org/packages/a2/55/8f8cab2afd404cf578136ef2cc5dfb50baa1761b68c9da1fb1e4eed343c9/docopt-0.6.2.tar.gz
Collecting dawg-python>=0.7 (from pymorphy2==0.8)
  Downloading https://files.pythonhosted.org/packages/6a/84/ff1ce2071d4c650ec85745766c0047ccc3b503

In [None]:
# 1.) go to the folder where you the ipynb is
!mkdir ru2
!git clone -b v2.1 https://github.com/buriy/spacy-ru.git
!cp -r ./spacy-ru/ru2/. ru2/


In [94]:
import spacy
sample_sentences = "Привет Миру! Как твои дела? Сегодня неплохая погода."
if __name__ == '__main__':
    nlp = spacy.load('ru2')
    nlp.add_pipe(nlp.create_pipe('sentencizer'), first=True)
    doc = nlp(sample_sentences)
    for s in doc.sents:
        print(list(['lemma "{}" from text "{}"'.format(t.lemma_, t.text) for t in s]))

['lemma "привет" from text "Привет"', 'lemma "мир" from text "Миру"', 'lemma "!" from text "!"']
['pos "92" from text "Привет"', 'pos "96" from text "Миру"', 'pos "97" from text "!"']
['lemma "как" from text "Как"', 'lemma "твой" from text "твои"', 'lemma "дело" from text "дела"', 'lemma "?" from text "?"']
['pos "86" from text "Как"', 'pos "90" from text "твои"', 'pos "92" from text "дела"', 'pos "97" from text "?"']
['lemma "сегодня" from text "Сегодня"', 'lemma "неплохой" from text "неплохая"', 'lemma "погода" from text "погода"', 'lemma "." from text "."']
['pos "86" from text "Сегодня"', 'pos "84" from text "неплохая"', 'pos "92" from text "погода"', 'pos "97" from text "."']


In [98]:
doc = nlp(sample_sentences)
for s in doc.sents:
    print(list(['pos "{}" from text "{}"'.format(t.pos_, t.text) for t in s]))
    print(list(['pos "{}" from text "{}"'.format(t.dep_, t.text) for t in s]))

['pos "NOUN" from text "Привет"', 'pos "PROPN" from text "Миру"', 'pos "PUNCT" from text "!"']
['pos "ROOT" from text "Привет"', 'pos "appos" from text "Миру"', 'pos "punct" from text "!"']
['pos "ADV" from text "Как"', 'pos "DET" from text "твои"', 'pos "NOUN" from text "дела"', 'pos "PUNCT" from text "?"']
['pos "mark" from text "Как"', 'pos "det" from text "твои"', 'pos "ROOT" from text "дела"', 'pos "punct" from text "?"']
['pos "ADV" from text "Сегодня"', 'pos "ADJ" from text "неплохая"', 'pos "NOUN" from text "погода"', 'pos "PUNCT" from text "."']
['pos "advmod" from text "Сегодня"', 'pos "amod" from text "неплохая"', 'pos "ROOT" from text "погода"', 'pos "punct" from text "."']


In [93]:
doc.ents


(Миру,)

In [100]:
doc[1:4].text

'Миру! Как'

## Spacy internals

In [29]:
nlp = spacy.load("en_core_web_md")
doc = nlp("I love coffee.")

In [32]:
print(doc.vocab)
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

<spacy.vocab.Vocab object at 0x7f2a4e44eef0>
hash value: 3197928453018144401
string value: coffee


In [37]:
# lexeme is a vocab entry
lexeme = nlp.vocab['coffee']
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


In [43]:
# spans are slices of the doc object
from spacy.tokens import Doc, Span

span = Span(doc, 1, 3)
print(span.text)

span = Span(doc, 1, 3, label="part 2")
print(span.text)

love coffee
love coffee


## UNIT 5: spaCy rule-based matching

Here we can define simple patterns, which we can match not only to the text, but also to the annotations (eg. the POS tags).

This can be use for information extraction tasks.

See: https://course.spacy.io/chapter1, item 10

![alt text](screen1.png "Title")


In [53]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

print(matches)

[(9528407286733565721, 1, 3)]


In [54]:
print(matches)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(9528407286733565721, 1, 3)]
iPhone X


In [55]:
# a pattern spanning 5 tokens
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
doc = nlp("2018 FIFA World Cup: France won!")


In [56]:
# multiple conditions for one token
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
doc = nlp("I loved dogs but now I love cats more.")


In [57]:
# match on token 2 is optional
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")