# Spacy 

 Spacy is a Python library that allows us to perform many NLP tasks such as
 
 
    - Tokenization
    - Pos Tagging
    - Entity Detection
    - Dependency Parsing
    - Semantic Similarity

 
 Another important advantages of  Spacy is that it can be used to generated word vectors (word embeddings) for every word in a sentence. Spacy already includes some of these pre-trained models, which you can use right now!!.
 

 https://nlpforhackers.io/complete-guide-to-spacy/
 
  
## Install spacy
First, you must install spacy on your google colab. 



In [48]:
  
!pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz


Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K     |████████████████████████████████| 37.4MB 113kB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.0.0-cp36-none-any.whl size=37405977 sha256=79e38cdba87edf1e2275902e18198f39dddf0981fdeb80b77439b316d375e4e1
  Stored in directory: /root/.cache/pip/wheels/54/7c/d8/f86364af8fbba7258e14adae115f18dd2c91552406edc3fdaa
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.1.0
    Uninstalling en-core-web-sm-2.1.0:
      Successfully uninstalled en-core-web-sm-2.1.0
Successfully installed en-core-web-sm-2.0.0


Once you have installed Spacy, you need to download some of models that Spacy provides. 

 You can find the list of available models at https://spacy.io/usage/models
 
 In this tutorial, we will use the model **en_core_web_md**, which includes Vocabulary, POS tags, dependency parse, named entities. and word vectors (for obtaining semantic similarity)


For English, you may download the model "en". Then, you can load the model 'en'

In [0]:
!python3 -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')           # load model package "en_core_web_sm"
print('spacy.en loaded')

Collecting en_core_web_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
[K     |████████████████████████████████| 11.1MB 4.0MB/s 
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.1.0-cp36-none-any.whl size=11074435 sha256=dc447bb5de8dc9b63c7f3893783715e5d210d38a749cc886ea031b0d507cc596
  Stored in directory: /tmp/pip-ephem-wheel-cache-5pq63xn9/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.0.0
    Uninstalling en-core-web-sm-2.0.0:
      Successfully uninstalled en-core-web-sm-2.0.0
Successfully installed en-core-web-sm-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model


## Mounting your drive folder

Now, we will learn how to load text documents. In order to load the data, we'll need to mount your Drive folder first and give the access to the Notebook. This will require one-step authentication. Please when you run the cell below follow the instructions.




In [0]:
from google.colab import drive
drive.mount("/content/drive/")
!ls

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
drive  sample_data


 Now, we can load a text file:


In [0]:
sst_home='drive/My Drive/Colab Notebooks/'
#replace this folder with the name of your folder in Google Colab, 
#where you are saving your notebooks of this course
sst_home += 'CURSONLP/1-basicNLP/'

filename = sst_home + 'example1.txt'

text = open(filename).read()

print(text)

Pedro Sánchez Pérez-Castejón (born 29 February 1972) is a Spanish politician serving as Prime Minister of Spain since 2 June 2018. On 7 January 2020, Pedro Sanchez was confirmed by the Congress of Deputies as Prime Minister with a lead of just two votes (167 to 165), at the helm of the first coalition government since the restoration of democracy in the 1970s, ending the political deadlock that included two inconclusive elections.He has also been Secretary-General of the Spanish Socialist Workers' Party (PSOE) since June 2017, having previously held that office from 2014 to 2016.
He served as a Madrid city councillor from 2004 to 2009, before being elected to the Congress of Deputies. In 2014, he became Secretary-General of the PSOE, becoming the party's candidate for Prime Minister in the 2015 and 2016 general elections. Sánchez resigned as Secretary-General after disagreements with the party's executive, and was re-elected the following year during a series of primaries, defeating Su


## Sentence splitting

Spacy also returns the list of sentences in the text. The list of sentences is stored in the property **sents**. For each sentence, you can traverse its tokens and read their properties. 

In the following cell, for each token in a sentence, we show token and its PoS tag :


In [0]:
document = nlp(text)

for i,s in enumerate(document.sents):
    print(i,s)
    #for token in s:
    #  print('\t',token.orth_, token.pos_)


0 Pedro Sánchez Pérez-Castejón (born 29 February 1972) is a Spanish politician serving as Prime Minister of Spain since 2 June 2018.
1 On 7 January 2020, Pedro Sanchez was confirmed by the Congress of Deputies as Prime Minister with a lead of just two votes (167 to 165), at the helm of the first coalition government since the restoration of democracy in the 1970s, ending the political deadlock that included two inconclusive elections.
2 He has also been Secretary-General of the Spanish Socialist Workers' Party (PSOE) since June 2017, having previously held that office from 2014 to 2016.

3 He served as a Madrid city councillor from 2004 to 2009, before being elected to the Congress of Deputies.
4 In 2014, he became Secretary-General of the PSOE, becoming the party's candidate for Prime Minister in the 2015 and 2016 general elections.
5 Sánchez resigned as Secretary-General after disagreements with the party's executive, and was re-elected the following year during a series of primaries

## Tokenization 

Once you have loaded a document, you can parser it using the Spaci. You only have to pass the document as argument to the object **nlp**, which was created above. This return a list of objects, which represent the tokens of the text and their properties. These properties are very useful in the representation of instances for NLP tasks such as Named Entity Recognition or Text Clasification. 

In the next cell, we iterate the tokens and show some of their properties :
-    original text
-    shape (which is a pattern of its uppercase and lowercase)
-  PoS tag
-   lemma of token
-    prefix
-    sufix
-    Brown cluster id: cluster which the token belongs

You can find the description of the token properties at the following link: https://spacy.io/api/token#attributes

In [0]:

for i, token in enumerate(document):
    print("original:", token.orth_)
    print("shape:", token.shape_)
    print("PoS tag:", token.pos_)


    #print("lowercased:", token.lower_)
    print("lemma:", token.lemma_)
    print("prefix:", token.prefix_)
    print("suffix:", token.suffix_)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    #only shows 5 tokens
    if i > 10:
        break

original: Pedro
shape: Xxxxx
PoS tag: PROPN
lemma: Pedro
prefix: P
suffix: dro
Brown cluster id: 0
----------------------------------------
original: Sánchez
shape: Xxxxx
PoS tag: PROPN
lemma: Sánchez
prefix: S
suffix: hez
Brown cluster id: 0
----------------------------------------
original: Pérez
shape: Xxxxx
PoS tag: PROPN
lemma: Pérez
prefix: P
suffix: rez
Brown cluster id: 0
----------------------------------------
original: -
shape: -
PoS tag: PUNCT
lemma: -
prefix: -
suffix: -
Brown cluster id: 0
----------------------------------------
original: Castejón
shape: Xxxxx
PoS tag: PROPN
lemma: Castejón
prefix: C
suffix: jón
Brown cluster id: 0
----------------------------------------
original: (
shape: (
PoS tag: PUNCT
lemma: (
prefix: (
suffix: (
Brown cluster id: 0
----------------------------------------
original: born
shape: xxxx
PoS tag: VERB
lemma: bear
prefix: b
suffix: orn
Brown cluster id: 0
----------------------------------------
original: 29
shape: dd
PoS tag: NUM
lemma:

## Entity Detection
Spacy is also able to identify the named entities that occur in the text. The named entities contained in the document are stored in its property **ents**: 
For each entity, you can acces the following properties:
- string: contains the whole mention of the named entity.
- label: is the entity type.
- start_char and end_char are the offsets of the mention in the text.






In [0]:
for i,s in enumerate(document.sents):
  print("Text sentence: ", i, s)
  print('Named entities for the sentence: ', i)
  for e in s.ents:
    print('\t',e.string,e.label_,e.start_char, e.end_char)
  print

#for w in document.ents:
#  print(w.string,w.label_)
  
  


Text sentence:  0 Pedro Sánchez Pérez-Castejón (born 29 February 1972) is a Spanish politician serving as Prime Minister of Spain since 2 June 2018.
Named entities for the sentence:  0
	 Pedro Sánchez  PERSON 0 13
	 February 1972 DATE 38 51
	 Spanish  NORP 58 65
	 Spain  GPE 106 111
	 2 June 2018 DATE 118 129
Text sentence:  1 On 7 January 2020, Pedro Sanchez was confirmed by the Congress of Deputies as Prime Minister with a lead of just two votes (167 to 165), at the helm of the first coalition government since the restoration of democracy in the 1970s, ending the political deadlock that included two inconclusive elections.
Named entities for the sentence:  1
	 7 January 2020 DATE 134 148
	 Pedro Sanchez  PERSON 150 163
	 the Congress of Deputies  ORG 181 205
	 two  CARDINAL 244 247
	 167  CARDINAL 255 258
	 165 CARDINAL 262 265
	 first  ORDINAL 287 292
	 the 1970s DATE 352 361
	 two  CARDINAL 407 410
Text sentence:  2 He has also been Secretary-General of the Spanish Socialist Worker

## Noun chunker

You can obtain the list of noun phrases in a sentence directly using the property **noun_chunks**. 

For each noun phrase, it is possible to return the following properties:
- text: the whole text of the noun phrase
- root: the root of the noun phrase
- dep: the grammatical relationship of the noun phrase in in the sentence. You can find more information about these relationships (dependencies) at https://nlp.stanford.edu/software/dependencies_manual.pdf. Some of these dependencies are:
 - *nsubj*: it is the syntactic subject. For example, 'Clinton defeated Dole', nsubj(defeated,Clinton)
 - *dobj*:  it is the direct object of a VP. For example,  'She gave me a raise' - > dobj (gave, raise).
 - *pobj*: it is the object of a preposition  (for example, 'I sit on the chair' -> pobj(on, chair)
- head: it represent the source of the grammatical relationship. For example, the relationship nsubj between shift and Autonomus cars.  
 

 
 

 
 




In [0]:
import spacy
text= "The boy with the spotted dog quickly ran after the firetruck."
#text= "I saw the keys on the table."
doc = nlp(text)

for chunk in doc.noun_chunks:
    print('text chunk:',chunk.text)
    print('root chunk:',chunk.root.text)
    print('grammatical dependency:',chunk.root.dep_)
    print('head chunk:',chunk.root.head.text)
    print('---------------------------------')
    

text chunk: The boy
root chunk: boy
grammatical dependency: nsubj
head chunk: ran
---------------------------------
text chunk: the spotted dog
root chunk: dog
grammatical dependency: pobj
head chunk: with
---------------------------------
text chunk: the firetruck
root chunk: firetruck
grammatical dependency: pobj
head chunk: after
---------------------------------





## Dependency parsing

SpaCy provides a fast and accurate syntactic dependency parser. This parser obtains the grammatical relations between tokens in a sentence. This information is crucial for NLP task such as relation extraction. 

You can try to parse any sentence using  the online tool http://nlp.stanford.edu:8080/corenlp/process


The following cell shows the dependency parsing for a sentence. For each token, you can determine 


In [0]:
# Let's look at the dependencies of this example:
example = "The boy with the spotted dog quickly ran after the firetruck."
#example="I saw the keys on the table."
parsedEx = nlp(example)
# shown as: original token, dependency tag, head word
for token in parsedEx:
    print("word:",token.orth_)
    print("grammatical relation:", token.dep_)
    print("connected word (head):", token.head.orth_)
    print('------------------------------------------')

word: The
grammatical relation: det
connected word (head): boy
------------------------------------------
word: boy
grammatical relation: nsubj
connected word (head): ran
------------------------------------------
word: with
grammatical relation: prep
connected word (head): boy
------------------------------------------
word: the
grammatical relation: det
connected word (head): dog
------------------------------------------
word: spotted
grammatical relation: amod
connected word (head): dog
------------------------------------------
word: dog
grammatical relation: pobj
connected word (head): with
------------------------------------------
word: quickly
grammatical relation: advmod
connected word (head): ran
------------------------------------------
word: ran
grammatical relation: ROOT
connected word (head): ran
------------------------------------------
word: after
grammatical relation: prep
connected word (head): ran
------------------------------------------
word: the
grammatical re