## Intro to Spacy

When working with Jupyterlab in Googles Deep Learning VM, it turns out to be necessary to install the Python packages directly from Jupyterlab cells - and not via pip: 

**To install spacy (or any other package):**

    import sys
    !{sys.executable} -m pip install spacy scipy

**To get embeddings into spacy:**

    !python3 -m spacy download en_core_web_lg

### Getting Started with Spacy (English)

- Removing Stopwords 
- Lemmatizing

In [45]:
#Imports 
import spacy 
from spacy.lang.en import English
nlp = English()

#Doc
doc = nlp('''In what manner has Republican backing of "states rights" been hypocritical
and what ways have they actually restricted the ability of states to make their own laws?''')

In [46]:
doc[2:5]

manner has Republican

#### Stopwords

In [47]:
from spacy.lang.en import STOP_WORDS

stopwords = list(STOP_WORDS)
print(len(stopwords))
stopwords[:5]

305


['however', 'are', 'never', 'quite', 'neither']

In [48]:
for word in doc: 
    if word.is_stop:
        print(word.text, end=" | ")

what | has | of | been | and | what | have | they | the | of | to | make | their | own | 

#### Lemmas

In [49]:
for word in doc: 
    print(word.text, '--> Lemma:', word.lemma_, end=' | ')

In --> Lemma: In | what --> Lemma: what | manner --> Lemma: manner | has --> Lemma: have | Republican --> Lemma: Republican | backing --> Lemma: back | of --> Lemma: of | " --> Lemma: " | states --> Lemma: state | rights --> Lemma: right | " --> Lemma: " | been --> Lemma: be | hypocritical --> Lemma: hypocritical | 
 --> Lemma: 
 | and --> Lemma: and | what --> Lemma: what | ways --> Lemma: way | have --> Lemma: have | they --> Lemma: they | actually --> Lemma: actually | restricted --> Lemma: restrict | the --> Lemma: the | ability --> Lemma: ability | of --> Lemma: of | states --> Lemma: state | to --> Lemma: to | make --> Lemma: make | their --> Lemma: their | own --> Lemma: own | laws --> Lemma: law | ? --> Lemma: ? | 

#### Punctuation 


In [50]:
import string

punctuations = string.punctuation
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

#### Linguistic Annotations

Install first via command line: ** *python -m spacy download en_core_web_md* **

Refer to: https://spacy.io/models/en#section-en_core_web_md

In [51]:
nlp = spacy.load('en_core_web_md')
doc = nlp(u'New York: Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text, token.pos_, token.dep_)

New PROPN compound
York PROPN npadvmod
: PUNCT punct
Apple PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


In [52]:
spacy.explain('PROPN')

'proper noun'

In [53]:
spacy.explain('npadvmod')

'noun phrase as adverbial modifier'

#### Visualizing Dependencies

In [54]:
#Option 1 
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True, options={'distance': 120, 'offset_x': 25})

In [55]:
displacy.render(doc, style='ent', jupyter=True)

In [56]:
spacy.explain('GPE')

'Countries, cities, states'

### German? 

In [57]:
#Imports 
import spacy 
from spacy.lang.de import German

nlp_de = German()

In [58]:
#Stopwords 
from spacy.lang.de import STOP_WORDS

stopwords_de = list(STOP_WORDS)
print(len(stopwords_de))
stopwords_de[:8]

543


['gekannt',
 'mir',
 'währenddessen',
 'diejenige',
 'oben',
 'tat',
 'zehnten',
 'offen']

In [59]:
doc = nlp_de("Inwiefern konnte die deutsche Bundesregierung eine wirtschaftliche Trendwende seit 2008 einleiten?")

#stopwords 
for word in doc: 
    if word.is_stop:
        print(word.text, end=" | ")
print('\n -------------------------\n')
      
#lemmata 
for word in doc: 
    print(word.text, '--> Lemma:', word.lemma_, end=' | ')
print('\n -------------------------\n')
    
#annotations (n/a if corresponding model not loaded)
for token in doc:
    print(token.text, token.pos_, token.dep_)

konnte | die | eine | seit | 
 -------------------------

Inwiefern --> Lemma: Inwiefern | konnte --> Lemma: können | die --> Lemma: der | deutsche --> Lemma: deutsch | Bundesregierung --> Lemma: Bundesregierung | eine --> Lemma: einen | wirtschaftliche --> Lemma: wirtschaftlich | Trendwende --> Lemma: Trendwende | seit --> Lemma: seit | 2008 --> Lemma: 2008 | einleiten --> Lemma: einleiten | ? --> Lemma: ? | 
 -------------------------

Inwiefern  
konnte  
die  
deutsche  
Bundesregierung  
eine  
wirtschaftliche  
Trendwende  
seit  
2008  
einleiten  
?  


### Embeddings with Spacy

Refer to https://spacy.io/models/en#section-en_core_web_sm.

In [60]:
from spacy.tokens import Token

#nlp = spacy.load('en_core_web_md')
doc = nlp(u'New York: Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

New True 5.265908 False
York True 6.940326 False
: True 5.474056 False
Apple True 7.1346846 False
is True 4.890306 False
looking True 5.4164834 False
at True 6.0998254 False
buying True 6.2184978 False
U.K. True 6.626984 False
startup True 6.779131 False
for True 4.8435082 False
$ True 7.748268 False
1 True 5.269974 False
billion True 8.310136 False


In [61]:
doc = nlp(u'Demystifying crottles: a lichen used in Scotland to make a brownish dye for wool.')

for token in doc:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Demystifying True 6.245355 False
crottles False 0.0 True
: True 5.474056 False
a True 5.306696 False
lichen True 6.86926 False
used True 5.209864 False
in True 5.0929856 False
Scotland True 7.1264977 False
to True 4.74484 False
make True 5.0838113 False
a True 5.306696 False
brownish True 6.80549 False
dye True 7.085415 False
for True 4.8435082 False
wool True 7.6126904 False
. True 4.9316354 False


#### Tryout 1a: Getting Synonyms via word vectors

Slow implementation, see: https://github.com/explosion/spaCy/issues/276

In [62]:
nlp.vocab[u'dog']

<spacy.lexeme.Lexeme at 0x7f95426ff870>

In [63]:
def most_similar(word):
    queries = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return by_similarity[:10]

[w.lower_ for w in most_similar(nlp.vocab[u'dog'])]

['dog',
 'kennel',
 'canine',
 'hound',
 'canines',
 'dogs',
 'puppy',
 'poodle',
 'terrier',
 'husky']

#### Tryout 1b: Getting Synonyms via Brown Clusters

In [None]:
#see link

#### Tryout 2: Grasping Analogies 

In [64]:
doc = nlp(u'Paris France Rome Italy')
similarities = {}
for token1 in doc:
    for token2 in doc:
        similarities[str(token1) +'-'+ str(token2)] = token1.similarity(token2)

In [65]:
similarities

{'France-France': 1.0,
 'France-Italy': 0.7207298,
 'France-Paris': 0.7916327,
 'France-Rome': 0.5515125,
 'Italy-France': 0.7207298,
 'Italy-Italy': 1.0,
 'Italy-Paris': 0.58605415,
 'Italy-Rome': 0.72202295,
 'Paris-France': 0.7916327,
 'Paris-Italy': 0.58605415,
 'Paris-Paris': 1.0,
 'Paris-Rome': 0.58241165,
 'Rome-France': 0.5515125,
 'Rome-Italy': 0.72202295,
 'Rome-Paris': 0.58241165,
 'Rome-Rome': 1.0}

In [66]:
italy_fake = doc[1].vector - doc[0].vector + doc[2].vector
italy_real = doc[3].vector

print(len(italy_real))
italy_real[:5]

300


array([-0.21052  ,  0.18476  , -0.0056243, -0.15168  ,  0.78708  ],
      dtype=float32)

In [67]:
import numpy    

#calculate cosine similarity
numpy.dot(italy_fake, italy_real) / (numpy.linalg.norm(italy_fake) * numpy.linalg.norm(italy_real))

0.74263257

In [68]:
from scipy import spatial 
import numpy as np 

x = np.array([3,1])
y = np.array([1,3])

d = 1 - spatial.distance.cosine(x, y)
d

0.6

In [29]:
#load large model
nlp = spacy.load('en_core_web_lg')

class AnalogyFinder: 
    
    def __init__(self, positive_1, positive_2, negative, top=20):
        
        self.positive_1 = nlp.vocab[positive_1].vector 
        self.positive_2 = nlp.vocab[positive_2].vector 
        self.negative = nlp.vocab[negative].vector 
        self.top = top 
        self.analogies = self.compute_analogy()

    def compute_analogy(self):

        calculated_output_word = self.positive_1 - self.negative + self.positive_2
        cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
        computed_similarities = []
 
        for word in nlp.vocab:

            # Ignore words without vectors
            if not word.has_vector:
                continue
            
            #go for lowercase
            if word.is_lower:
                similarity = cosine_similarity(calculated_output_word, word.vector)
                computed_similarities.append((word, similarity))

        computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
        
        self.analogies = [w[0].text for w in computed_similarities[:self.top]]
        
        return self.analogies

**Let's play a little... the famous king example...**

In [30]:
maybe_queen = AnalogyFinder("man","king","woman")
maybe_queen.analogies

['king',
 'kings',
 'lord',
 'prince',
 'kingdom',
 'throne',
 'reign',
 'mighty',
 'princes',
 'emperor',
 'god',
 'duke',
 'thee',
 'thou',
 'lords',
 'royal',
 'hath',
 'man',
 'queen',
 'beast']

In [37]:
maybe_queen_2 = AnalogyFinder("woman","king","man")
maybe_queen_2.analogies

['king',
 'queen',
 'prince',
 'kings',
 'princess',
 'royal',
 'throne',
 'queens',
 'monarch',
 'kingdom',
 'empress',
 'lady',
 'woman',
 'princes',
 'mother',
 'duke',
 'emperor',
 'reign',
 'goddess',
 'lord']

**How about the aunts?**

In [31]:
maybe_aunt = AnalogyFinder("man","uncle","woman")
maybe_aunt.analogies

['uncle',
 'brother',
 'nephew',
 'grandfather',
 'father',
 'brother-in-law',
 'dad',
 'brothers',
 'grandpa',
 'uncles',
 'father-in-law',
 'bro',
 'grandson',
 'cousin',
 'buddy',
 'sons',
 'granddad',
 'son',
 'grandad',
 'man']

In [38]:
maybe_aunt_2 = AnalogyFinder("woman","uncle","man")
maybe_aunt_2.analogies

['aunt',
 'grandmother',
 'mother',
 'uncle',
 'sister',
 'wife',
 'daughter',
 'niece',
 'husband',
 'sister-in-law',
 'mother-in-law',
 'cousin',
 'father',
 'mom',
 'grandfather',
 'grandma',
 'granddaughter',
 'dad',
 'stepfather',
 'daughters']

**E voilà, we get there... :)**

In [43]:
maybe_walked = AnalogyFinder("swam","walk","swim")
maybe_walked.analogies

['walked',
 'walk',
 'strolled',
 'wandered',
 'walking',
 'jogged',
 'swam',
 'drove',
 'sprinted',
 'stood',
 'waited',
 'biked',
 'beside',
 'stroll',
 'walks',
 'trekked',
 'chased',
 'trudged',
 'rode',
 'ambled']

In [44]:
maybe_paris = AnalogyFinder("Rome","France","Italy")
maybe_paris.analogies

['rome',
 'france',
 'paris',
 'lyon',
 'europe',
 'toulouse',
 'french',
 'roman',
 'marseille',
 'prague',
 'romans',
 'versailles',
 'cannes',
 'amsterdam',
 'francais',
 'berlin',
 'montreal',
 'quebec',
 'eiffel',
 'maison']

**Pretty neat... :) However, choosing the 'stronger' word as negative one - i.e. 'man' instead of 'uncle'/'king', 'swim' instead of 'swam', 'Italy' instead of 'Rome' - turns out to be important!**