# Vector representation

Examples:

In [1]:
doc1 = 'Summer is coming but Summer is short'
doc2 = 'I like the Summer and I like the Winter'
doc3 = 'I like sandwiches and I like the Winter'
documents = [doc1, doc2, doc3]

Focus will be made un scikit-learn library. Pandas will also be used.

## Count vector

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word', max_features=5000)
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
vectors = vectorizer.fit_transform(documents)
print(vectors.toarray())

[[0 1 1 2 0 0 1 2 0 0]
 [1 0 0 0 2 0 0 1 2 1]
 [1 0 0 0 2 1 0 0 1 1]]


In [7]:
print(vectorizer.get_feature_names())

['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short', 'summer', 'the', 'winter']


In [9]:
vectorizer = CountVectorizer(analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w+\\b')
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['and',
 'but',
 'coming',
 'i',
 'is',
 'like',
 'sandwiches',
 'short',
 'summer',
 'the',
 'winter']

Above you can see that 'I' is now included. So, use the 'token_pattern' above used, otherwise you may miss letters.
Next step is to filter the stop words.('and', 'but', 'i', 'is', 'the')

In [10]:
vectorizer = CountVectorizer(analyzer='word', stop_words='english', token_pattern='(?u)\\b\\w+\\b')
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [11]:
# stop words in scikit for english
print(vectorizer.get_stop_words())

frozenset({'keep', 'more', 'although', 'though', 'together', 'had', 'throughout', 'towards', 'twelve', 'none', 'thereafter', 'third', 'system', 'on', 'between', 'thin', 'my', 'cant', 'found', 'himself', 'eg', 'often', 'still', 'hasnt', 'because', 'either', 'seemed', 'ten', 'yourself', 'him', 'beyond', 'he', 'namely', 'part', 'her', 'whenever', 'why', 'serious', 'among', 'a', 'becomes', 'few', 'latterly', 'you', 'name', 'perhaps', 'somewhere', 'thereupon', 'wherein', 'hereby', 'nevertheless', 'fire', 'someone', 'whatever', 'than', 'cannot', 'through', 'however', 'whom', 'front', 'show', 'couldnt', 'each', 'else', 'enough', 'hence', 'describe', 'alone', 'due', 'elsewhere', 'seeming', 'thereby', 'least', 'anyhow', 'already', 'ltd', 'above', 'noone', 'top', 'indeed', 'not', 'other', 'latter', 'un', 'being', 'whence', 'at', 'such', 'fifty', 'wherever', 'do', 'bottom', 'call', 'inc', 'last', 'which', 'one', 'themselves', 'within', 'should', 'onto', 'others', 'of', 'therefore', 'never', 'almo

In [12]:
# Vectors
f_array = vectors.toarray()
f_array

array([[1, 0, 0, 1, 2, 0],
       [0, 2, 0, 0, 1, 1],
       [0, 2, 1, 0, 0, 1]], dtype=int64)

Each word of the feature_names represent a column. The number says how many times does that word repeat in the row where it currently is. Three rows for three sentences.
Next thing is to compute distance between vectors

In [13]:
from scipy.spatial.distance import cosine
d12 = cosine(f_array[0], f_array[1])
d13 = cosine(f_array[0], f_array[2])
d23 = cosine(f_array[1], f_array[2])
print(d12, d13, d23)

0.6666666666666667 1.0 0.16666666666666663


## Binary vectors
You may as well transform a given vector into a binary

In [14]:
vectorizer = CountVectorizer(analyzer='word', stop_words='english', binary=True)
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [15]:
vectors.toarray()

array([[1, 0, 0, 1, 1, 0],
       [0, 1, 0, 0, 1, 1],
       [0, 1, 1, 0, 0, 1]], dtype=int64)