#  Word Embedding

## Bag of words

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
text = ['the man went out for a walk', 'the children sat around the fire']

In [3]:
vector = CountVectorizer()
vector.fit(text)
print('Vocabulary - Set of unique words in the corpus: ', str(vector.vocabulary_))

Vocabulary - Set of unique words in the corpus:  {'the': 7, 'man': 4, 'went': 9, 'out': 5, 'for': 3, 'walk': 8, 'children': 1, 'sat': 6, 'around': 0, 'fire': 2}


You see the word 'a' missing.

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

All the single character tokens are ignored by the default tokenizer. That is the reason why a is missing.

If you want single character tokens to be in the vocabulary, then you have to use a costume tokenizer.

Refer the link below link for reference.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer(tokenizer=lambda text: text.split())
vector.fit(text)
print ('the feature names: the vocabulary: ', vector.get_feature_names())

the feature names: the vocabulary:  ['a', 'around', 'children', 'fire', 'for', 'man', 'out', 'sat', 'the', 'walk', 'went']




In [5]:
print ('The unique words in vocabulary with their assigned numbers: ',vector.vocabulary_)

The unique words in vocabulary with their assigned numbers:  {'the': 8, 'man': 5, 'went': 10, 'out': 6, 'for': 4, 'a': 0, 'walk': 9, 'children': 2, 'sat': 7, 'around': 1, 'fire': 3}


In [6]:
counts = vector.transform(text)
print(counts)

  (0, 0)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	1
  (1, 7)	1
  (1, 8)	2


0,1 represent documentA and documentB. The corresponding elements represent the uniue words as printed above (Reference: {'the': 8, 'man': 5, 'went': 10, 'out': 6, 'for': 4, 'a': 0, 'walk': 9, 'children': 2, 'sat': 7, 'around': 1, 'fire': 3}).

Last column represent the repeatation of a term in a document. E.g. the word 'the' is repeated twice. 

In [7]:
print('The shape of the count is: '+str(counts.shape))

The shape of the count is: (2, 11)


Here 2 represents number of documents/sentences. 11 is number of unique words or we can say length of vocabulary.

In [8]:
print('Printing the word vectors of size 11 dimensions for documentA and documentB\n', str(counts.toarray()))

Printing the word vectors of size 11 dimensions for documentA and documentB
 [[1 0 0 0 1 1 1 0 1 1 1]
 [0 1 1 1 0 0 0 1 2 0 0]]


##   TF-IDF

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer


TfidfVectorizer can be used when input is not taken from BOW. TfidfTransformer is used if input is taken from BOW

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
#name = TfidfVectorizer()
vectorizer = TfidfTransformer()

# the counts is reference from the above count vectorizer code.
vectorizer.fit(counts)
print('Learning frequencies of all features/terms/words: ', str(vectorizer.idf_))

Learning frequencies of all features/terms/words:  [1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.         1.40546511 1.40546511]


In [11]:
vectorizer.idf_.shape

(11,)

In [12]:
freq = vectorizer.transform(counts)
print(freq)

  (0, 10)	0.3920440146223274
  (0, 9)	0.3920440146223274
  (0, 8)	0.2789425453258252
  (0, 6)	0.3920440146223274
  (0, 5)	0.3920440146223274
  (0, 4)	0.3920440146223274
  (0, 0)	0.3920440146223274
  (1, 8)	0.5797386715376657
  (1, 7)	0.40740123733358447
  (1, 3)	0.40740123733358447
  (1, 2)	0.40740123733358447
  (1, 1)	0.40740123733358447


In [13]:
print('Transforming the matrix based on the learnt frequencies or weights:\n\n', str(freq.toarray()))

Transforming the matrix based on the learnt frequencies or weights:

 [[0.39204401 0.         0.         0.         0.39204401 0.39204401
  0.39204401 0.         0.27894255 0.39204401 0.39204401]
 [0.         0.40740124 0.40740124 0.40740124 0.         0.
  0.         0.40740124 0.57973867 0.         0.        ]]
