# Sample text data


I’m creating 4 sentences on which we’ll apply each of these techniques and understand how they work. For each of the techniques, I’ll use lowercase words only.

# CountVectorizer

 Identify unique words in the complete text data  For each sentence, we’ll create an array of zeros with the same length as the length of the unique words vector.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from itertools import chain

In [43]:
sentences = ['He is playing in the field',
             'He is running towards the football.',
             'The football game ended.',
             'It started raining while everyone was playing in the field.'
]

In [32]:
vectorizer = CountVectorizer()
sentence_vectors = vectorizer.fit_transform(sentences)

In [33]:
print(vectorizer.get_feature_names())

['ended', 'everyone', 'field', 'football', 'game', 'he', 'in', 'is', 'it', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']


In [27]:
sentence_vectors.toarray()

array([[0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1]], dtype=int64)

In [19]:
len(sentence_vectors.toarray()[0])

17

# TF-IDF Vectorizer

While Count Vectorizer converts each sentence into its own vector, it does not consider the importance of a word across the complete list of sentences. For example, He is in two sentences and it provides no useful information in differentiating between the two. Thus, it should have a lower weight in the overall vector of the sentence. This is where the TF-IDF Vectorizer comes into the picture.

Consider the word 'he'

Total documents (N): 4

Documents in which the word appears (n): 2

Number of times the word appears in the first sentence: 1

Number of words in the first sentence: 6

Term Frequency(TF) = 1

Inverse Document Frequency(IDF) = log(N/n)

                                = log(4/2)
                                
                                = log(2)
TF-IDF value = 1 * log(2)

             = 0.69314718

sklearn calculates tfidf in a different way :  1 * (log(N/n) + 1)

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [34]:
sentences

['He is playing in the field',
 'He is running towards the football.',
 'The football game ended.',
 'It started raining while everyone was playing in the field.']

In [36]:
vectorizer = TfidfVectorizer(norm = False, smooth_idf = False)
sentence_vectors = vectorizer.fit_transform(sentences)
print(sentence_vectors.toarray())

[[0.         0.         1.69314718 0.         0.         1.69314718
  1.69314718 1.69314718 0.         1.69314718 0.         0.
  0.         1.         0.         0.         0.        ]
 [0.         0.         0.         1.69314718 0.         1.69314718
  0.         1.69314718 0.         0.         0.         2.38629436
  0.         1.         2.38629436 0.         0.        ]
 [2.38629436 0.         0.         1.69314718 2.38629436 0.
  0.         0.         0.         0.         0.         0.
  0.         1.         0.         0.         0.        ]
 [0.         2.38629436 1.69314718 0.         0.         0.
  1.69314718 0.         2.38629436 1.69314718 2.38629436 0.
  2.38629436 1.         0.         2.38629436 2.38629436]]


# Word2Vec

These are a set of neural network models that have the aim to represent words in the vector space. These models are highly efficient and performant in understanding the context and relation between words

There are two models in this class:


---CBOW (Continuous Bag of Words): The neural network takes a look at the surrounding words (say 2 to the left and 2 to the right) and predicts the word that comes in between


---Skip-grams: The neural network takes in a word and then tries to predict the surrounding words

In [69]:
from gensim.models import word2vec

In [70]:
stringIn = "string.with.punctuation!"
punct = str.maketrans("","",string.punctuation)

In [71]:
sentences_low = list(map(lambda x: x.lower(), sentences))
sentences_tok = list(map(lambda x: x.split(), sentences_low))
rem_punct = list(map(lambda x: [elem.translate(punct) for elem in x], sentences_tok))

In [73]:
rem_punct

[['he', 'is', 'playing', 'in', 'the', 'field'],
 ['he', 'is', 'running', 'towards', 'the', 'football'],
 ['the', 'football', 'game', 'ended'],
 ['it',
  'started',
  'raining',
  'while',
  'everyone',
  'was',
  'playing',
  'in',
  'the',
  'field']]

In [91]:
model = word2vec.Word2Vec(rem_punct, workers = 1, size = 2, min_count = 1, window = 3, sg = 0)
similar_word = model.wv.most_similar('football')
print("Most common word to football is: {}".format(similar_word[0]))

Most common word to football is: ('raining', 0.9496791958808899)


In [88]:
model.wv.vocab

{'he': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0f60>,
 'is': <gensim.models.keyedvectors.Vocab at 0x1e5f66e03c8>,
 'playing': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0da0>,
 'in': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0390>,
 'the': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0dd8>,
 'field': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0d68>,
 'running': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0400>,
 'towards': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0d30>,
 'football': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0e48>,
 'game': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0cf8>,
 'ended': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0438>,
 'it': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0c50>,
 'started': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0cc0>,
 'raining': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0c18>,
 'while': <gensim.models.keyedvectors.Vocab at 0x1e5f66e0c88>,
 'everyone': <gensim.models.keyedvectors.Vocab at 0x1e5f6