## Document Vectors
Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by considering the context of words in the text into account.<br><br>
In this notebook we will create a Document Vector for using averaging via spacy. [spaCy](https://spacy.io/) is a python library for Natural Language Processing (NLP) which has a lot of built-in capabilities and features. spaCy has different types of models. The default model for the English language is '**en_core_web_sm**'.

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install spacy==2.2.4

# ===========================



In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
# downloading en_core_web_sm, assuming spacy is already installed
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py): started
  Building wheel for en-core-web-sm (setup.py): finished with status 'done'
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-py3-none-any.whl size=12011738 sha256=569c5173b5d308181f71764b1747fa5c0792b8e61b9d2fb23b154c63a1fe8ef5
  Stored in directory: C:\Users\KUMARA~1\AppData\Local\Temp\pip-ephem-wheel-cache-ajhpjg8a\wheels\b5\94\56\596daa677d7e91038cbddfcf32b591d0c915a1b3a3e3d3c79d
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.2.5
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [4]:
#Import spacy and load the model
import spacy
nlp = spacy.load("en_core_web_sm") #here nlp object refers to the 'en_core_web_sm' language model instance.

In [5]:
#Assume each sentence in documents corresponds to a separate document.
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

print("Document After Pre-Processing:",processed_docs)


#Iterate over each document and initiate an nlp instance.
for doc in processed_docs:
    doc_nlp = nlp(doc) #creating a spacy "Doc" object which is a container for accessing linguistic annotations. 
    
    print("-"*30)
    print("Average Vector of '{}'\n".format(doc),doc_nlp.vector)#this gives the average vector of each document
    for token in doc_nlp:
        print()
        print(token.text,token.vector)#this gives the text of each word in the doc and their respective vectors.
        

Document After Pre-Processing: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
------------------------------
Average Vector of 'dog bites man'
 [ 1.581809    0.01234585 -1.7686375   0.09192207  0.7099541   0.78623253
 -0.22294267  0.6251769   3.3184946   2.151605    1.2572327   0.80755144
  2.5496542   1.0309687  -1.1097575  -1.3333966   0.17883821  0.0732042
 -1.652985   -2.0776246  -1.1226162  -1.1873754  -0.32874927 -0.9920702
  0.53296596 -0.8248777  -0.1621434  -1.3841887   1.7683312  -0.60252315
  2.2155359  -0.05637012 -0.42379442 -0.8782012  -1.7660996  -0.8552634
  2.000716   -0.6098452  -3.9005392   0.8609192   4.002558    2.4361043
  0.33687058 -1.50351     0.17922412 -0.37065843 -0.84056693 -1.1102012
 -0.7800233   0.47405422 -1.460539   -1.7216535  -0.5347146  -0.94166833
 -1.2782294  -0.00511471  2.1441052  -1.0279812  -2.0211596  -0.2488722
 -0.06398487  1.988259    0.01575983 -0.17211278  1.9156709   1.4806844
  0.6663373  -1.2645482  -2.3974214   