# Vector Representation

Transform text into feature vectors.

In [1]:
doc1 = 'Summer is coming but Summer is short'
doc2 = 'I like the Summer and I like the Winter'
doc3 = 'I like sandwiches and I like the Winter'
documents = [doc1, doc2, doc3]

##  Count Vector

Main classed used will be CountVectorizer. Binary vectors will be assumed.
The vector will contain samples of other texts as represented above

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = 'word', max_features = 5000)

Vectorizer, between many other things, can preprocess the input

In [4]:
vectors = vectorizer.fit_transform(documents)
print(vectors.toarray())
print(vectorizer.get_feature_names())

[[0 1 1 2 0 0 1 2 0 0]
 [1 0 0 0 2 0 0 1 2 1]
 [1 0 0 0 2 1 0 0 1 1]]
['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short', 'summer', 'the', 'winter']


It is possible to select what to remove by defining a token pattern:

In [5]:
vectorizer = CountVectorizer(analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w+\\b')
vectors=vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['and',
 'but',
 'coming',
 'i',
 'is',
 'like',
 'sandwiches',
 'short',
 'summer',
 'the',
 'winter']

Now letter 'I' has been added. However, it is interesting to filter stop words:

In [7]:
vectorizer = CountVectorizer(analyzer='word', stop_words='english', token_pattern='(?u)\\b\\w+\\b')
vectors=vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

Vectors can be transformed into arrays fill out of numbers and distance between them can be calculated.

In [9]:
f_array = vectors.toarray()
f_array

array([[1, 0, 0, 1, 2, 0],
       [0, 2, 0, 0, 1, 1],
       [0, 2, 1, 0, 0, 1]], dtype=int64)

In [11]:
from scipy.spatial.distance import cosine
d12 = cosine(f_array[0], f_array[1])
d13 = cosine(f_array[0], f_array[2])
d23 = cosine(f_array[1], f_array[2])
print(d12, d13, d23)

0.666666666667 1.0 0.166666666667


## Tf-idf Vector Representation
Class TfidfVectorizer will be used rather than CountVectorizer

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer=TfidfVectorizer(analyzer = 'word', stop_words='english')
vectors=vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [13]:
vectors.toarray() # I don't get what it does

array([[ 0.48148213,  0.        ,  0.        ,  0.48148213,  0.73235914,
         0.        ],
       [ 0.        ,  0.81649658,  0.        ,  0.        ,  0.40824829,
         0.40824829],
       [ 0.        ,  0.77100584,  0.50689001,  0.        ,  0.        ,
         0.38550292]])

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

query = ['winter short']

# We transform the query into a vector of the learnt vocabulary
vector_query = vectorizer.transform(query)

# Here we calculate the distance of the query to the docs
cosine_similarity(vector_query, vectors)

  return f(*args, **kwds)


array([[ 0.38324078,  0.24713249,  0.23336362]])

In [15]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarity = linear_kernel(vector_query, vectors).flatten()
cosine_similarity

array([ 0.38324078,  0.24713249,  0.23336362])