## TF-IDF Vectorizer VS TFIDF Transformer
[VS docs](https://github.com/kavgan/nlp-in-practice/blob/master/tfidftransformer/TFIDFTransformer%20vs.%20TFIDFVectorizer.ipynb)

[TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)


Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.


## TFIDF VECTORIZER

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf(doc_path):
    vectorizer =TfidfVectorizer(use_idf=True)
    corpus = []
    with open(doc_path, "r") as f:
        corpus = f.readlines()

    # tfidf = vectorizer.fit_transform(corpus).toarray()
    tfidf = vectorizer.fit_transform(corpus)

    #get first vector from first doc
    first_vector_tfidf = tfidf[0]

    df = pd.DataFrame(first_vector_tfidf.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])
    return df.sort_values(by=["tfidf"],ascending=False)

    
# get_tf_idf("Data/oldmanandthesea.txt")

In [2]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# this is a very toy example, do not try this at home unless you want to understand the usage differences
docs=["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

## TFIDF TRANSFORMER

`cv.fit_transform(raw_documents)` 
- learn the vocabulary dictionary and return document-term matrix
- equivalent to fit followed by transform but more efficient

`cv.transform(raw_documents)` 
- transform docs to document-term matrix
- extract token counts out of raw text docs using vocab fitted with fit or one provided in constructor

`tfidf_transform.fit(raw_documents)`
- learn a vocab dictionary of all tokens in the raw docs

`tfidf-transform.fit_transform(raw_documents)`
- learn vocab dict and return doc-term matrix

In [3]:
#init Count Vectorizer()
cv = CountVectorizer()

#generates word counts for words in docs
word_count_vector=cv.fit_transform(docs)

# -- COMPUTE IDFS --
#init tfidf transformer
tfidf_transformer=TfidfTransformer(smooth_idf=True, use_idf=True)

tfidf_transformer.fit(word_count_vector)

#print idf vals
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
df_idf.sort_values(by=['idf_weights'])


Unnamed: 0,idf_weights
mouse,1.0
the,1.0
cat,1.693147
house,1.693147
ate,2.098612
away,2.098612
end,2.098612
finally,2.098612
from,2.098612
had,2.098612


### Compute the TFIDF score for documents
Now compute the tfidf scores for all 5 documents

In practice your IDF should be based on a very large corpora

`.T` = transpose of the array [docs numpy.ndarray.T](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.T.html)

In [17]:
count_vector = cv.transform(docs)

tf_idf_vector = tfidf_transformer.transform(count_vector)

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]

# print('FIRST DOC: \n', first_document_vector)
# print('\n FIRST DOC .T (transpose): \n', first_document_vector.T)
# print('\n .todense(): \n', first_document_vector.T.todense())

# print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=cv.get_feature_names(), columns=["tfidf"])
df.sort_values(by=['tfidf'], ascending=False)

Unnamed: 0,tfidf
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0
