<h1>**Exploring SpaCy**</h1>

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. Spacy is really good

In [7]:
#importing SpaCy 
import spacy

nlp = spacy.load('en')
doc = nlp(u'Trying to understand what Spacy does and exploring its different features')

<h2>**Tokenization**</h2>

In [10]:
for token in doc:
    print(token.text)

Trying
to
understand
what
Spacy
does
and
exploring
its
different
features


First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. **Does the substring match a tokenizer exception rule?** For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't".
2. **Can a prefix, suffix or infix be split off?** For example punctuation like commas, periods, hyphens or quotes.

<h2>**Part-of-speech tags and dependencies**</h2>

In [19]:
doc = nlp(u'Trying to understand what Spacy does and exploring its different features')

for token in doc:
    print()
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)


Trying try VERB VBG ROOT Xxxxx True False

to to PART TO aux xx True True

understand understand VERB VB xcomp xxxx True False

what what NOUN WP dobj xxxx True True

Spacy spacy PROPN NNP nsubj Xxxxx True False

does do VERB VBZ ccomp xxxx True True

and and CCONJ CC cc xxx True True

exploring explore VERB VBG conj xxxx True False

its -PRON- ADJ PRP$ poss xxx True True

different different ADJ JJ amod xxxx True False

features feature NOUN NNS dobj xxxx True False


**[column 1] Text: The original word text.**

**[column 2] Lemma: The base form of the word.**

**[column 3] POS: The simple part-of-speech tag.**

**[column 4] Tag: The detailed part-of-speech tag.**

**[column 5] Dep: Syntactic dependency, i.e. the relation between tokens.**

**[column 6] Shape: The word shape – capitalisation, punctuation, digits.**

**[column 7] is alpha: Is the token an alpha character?**

**[column 8] is stop: Is the token part of a stop list, i.e. the most common words of the language?**

<h2>**Named Entities**</h2>

In [21]:
doc = nlp(u'Trying to understand what Spacy does and exploring its different features')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Spacy 26 31 GPE


**[column 1] Text: The original entity text.**

**[column 2] Start: Index of start of entity in the Doc.**
    
**[column 3] End: Index of end of entity in the Doc.**
    
**[column 4] Label: Entity label, i.e. type.**

<h2>**Word vectors and similarity**</h2>

In [28]:
tokens = nlp(u'bus car man')

for token1 in tokens:
    for token2 in tokens:
        print(token1.similarity(token2)," \t")
        
       

1.0  	
0.659165  	
0.458621  	
0.659165  	
1.0  	
0.663724  	
0.458621  	
0.663724  	
1.0  	
