# Frequency-based Embeddings

## Count Vector Embedding

### references

[How areTF-IDF calculated by the scikit-learn TfidfVectorizer](https://stackoverflow.com/questions/36966019/how-aretf-idf-calculated-by-the-scikit-learn-tfidfvectorizer)

TF-IDF is done in multiple steps by Scikit Learn's __TfidfVectorizer__, which in fact __uses TfidfTransformer__ and __inherits CountVectorizer__.

Let me summarize the steps it does to make it more straightforward:

- tfs are calculated by __CountVectorizer's fit_transform()__
- idfs are calculated by __TfidfTransformer's fit()__
- tfidfs are calculated by __TfidfTransformer's transform()__

[How to Use Tfidftransformer & Tfidfvectorizer?](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XkJQm1IzZTY)


### term frequencies using CountVectorizer's fit_transform()

In [54]:
#calculate term frequencies

from sklearn.feature_extraction.text import CountVectorizer,\
TfidfTransformer

docs = ["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

#calculate term frequencies

vectorizer = CountVectorizer ()
vectorizer.fit (docs) #learn the vocabulary
print (vectorizer.vocabulary_)
print (vectorizer.get_feature_names ())
#return a 'document term matrix' having tf's (count of tokens)
word_count_vector = vectorizer.fit_transform (docs)
#print (word_count_vector)
print (word_count_vector.toarray ())

{'the': 14, 'house': 7, 'had': 6, 'tiny': 15, 'little': 8, 'mouse': 9, 'cat': 2, 'saw': 12, 'ran': 11, 'away': 1, 'from': 5, 'finally': 4, 'ate': 0, 'end': 3, 'of': 10, 'story': 13}
['ate', 'away', 'cat', 'end', 'finally', 'from', 'had', 'house', 'little', 'mouse', 'of', 'ran', 'saw', 'story', 'the', 'tiny']
[[0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1]
 [0 0 1 0 0 0 0 0 0 1 0 0 1 0 2 0]
 [0 1 0 0 0 1 0 1 0 1 0 1 0 0 2 0]
 [1 0 1 0 1 0 0 0 0 1 0 0 0 0 2 0]
 [0 0 0 1 0 0 0 0 0 1 1 0 0 1 2 0]]


## TF-IDF Vector Embedding

### inverse document frequencies using TfidfTransformer's fit()

<font color='blue'>Note:</font>
- For IDF, the log is to the __base 'e'__.
- __+1__ on IDF to prevent it from becoming 0 for cases where a word is found in all documents.
    - so, an idf value of __1.91__ is __actually 0.91 (ln(5/2))__
        - The word 'cat' appears in 2 out of 5 documents. So, its idf is ln(5/2) = 0.91 bumped up to 1.91

In [55]:
#calculate inverse-document frequencies

transformer = TfidfTransformer (norm=None, smooth_idf=False)
idf = transformer.fit (word_count_vector)
print (idf.idf_)

[2.60943791 2.60943791 1.91629073 2.60943791 2.60943791 2.60943791
 2.60943791 1.91629073 2.60943791 1.         2.60943791 2.60943791
 2.60943791 2.60943791 1.         2.60943791]


### tf-idf using TfidfTransformer's transform()

In [56]:
#calculate term-document-inverse-document frequencies

#transformer = TfidfTransformer ()
tf_idf = transformer.transform (word_count_vector)
print (tf_idf.toarray())

[[0.         0.         0.         0.         0.         0.
  2.60943791 1.91629073 2.60943791 1.         0.         0.
  0.         0.         1.         2.60943791]
 [0.         0.         1.91629073 0.         0.         0.
  0.         0.         0.         1.         0.         0.
  2.60943791 0.         2.         0.        ]
 [0.         2.60943791 0.         0.         0.         2.60943791
  0.         1.91629073 0.         1.         0.         2.60943791
  0.         0.         2.         0.        ]
 [2.60943791 0.         1.91629073 0.         2.60943791 0.
  0.         0.         0.         1.         0.         0.
  0.         0.         2.         0.        ]
 [0.         0.         0.         2.60943791 0.         0.
  0.         0.         0.         1.         2.60943791 0.
  0.         2.60943791 2.         0.        ]]


## Co-Occurrence Vector Embedding

# Prediction-based embeddings