### Word Vectors

In this notebook, we will explore some of SpaCy'S word vectors trained on different datasets. 

Interesting visualizations and presentations of word vectors:
- Sense2Vec Demo: https://explosion.ai/demos/sense2vec
- GloVe Vectors Visualizer: https://lamyiowce.github.io/word2viz/

If you want to incorporate word vectors into your final project, one idea might be using word vectors to use them to determine document similarity. 

You can see an example of this here: https://github.com/v1shwa/document-similarity

As we did in clustering, Word2Vec is essentially a dimensionality reduction solution. This means you can perhaps use the mean of the vectors in the document to say this is the "cluster space" of that document. So finding other documents in the similar cluster space could work.

I would recommend using the SpaCy Reddit vectors if you choose this approach.

For more on SpaCy word vectors and similarity, see: https://spacy.io/usage/vectors-similarity (Most of these examples were from this documentation!)


### Installing word vectors 

In [None]:
# This will take a few minutes but please only run it once!
!python -m spacy download en_core_web_lg
!python -m spacy download en

### Loading vectors

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_lg')

### Similarity based on word vectors

In [None]:
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))
    print()

### Out of vocabulary words

In [None]:
tokens = nlp(u'dog ape banana Katharine foooo')

for token in tokens:
    print(token.text, token.has_vector, 
        token.vector_norm, token.is_oov)

### Text Parsing

In [None]:
nlp.pipeline

In [None]:
my_doc = nlp("I like sushi.")

In [None]:
my_doc.

In [None]:
[(t, len(t.vector)) for t in my_doc]

In [None]:
my_doc[-2].vector

### Vector similarity for out of vocabulary words.

In [None]:
nlp = spacy.load("en")

doc1 = nlp(u"Trudy meows to be let in.")
doc2 = nlp(u"Trudy hunted a mouse.")
doc3 = nlp(u"My friend Trudy is an accountant.")
doc4 = nlp(u"We all know Trudy can be very obstinate.")

for doc in [doc1, doc2, doc3, doc4]:
    trudy = [w for w in doc if 'Trudy' in w.text][0]
    cat = nlp(u"cat")
    print(doc, cat.similarity(trudy))
    print()
    
trudy.is_oov