# TF-IDF

### TF-IDF (term frequency-inverse document frequency)
It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus which is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
* it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).
* TF-IDF can be broken down into two parts TF (term frequency) and IDF (inverse document frequency).
    * TF (term frequency)?
        * Term frequency-the frequency of a particular term you are concerned with relative to the document. 
    * IDF (inverse document frequency)?
        * Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus.
* TF gives us information on how often a term appears in a document and IDF gives us information about the relative rarity of a term in the collection of documents. By multiplying these values together we can get our final TF-IDF value which shows the importance of a term is inversely related to its frequency across documents.

#  Token Preparation
## Documents: for input


In [1]:
doc1='''Gangs of Wasseypur is a great movie. Wasseypur is a town in Bihar.'''
doc2='''The Success of a song depends on the music. '''
doc3='''There is a movie releasing this week. The movie is fun to watch.'''
collectdoc=[doc1,doc2,doc3] #document list


### Casefold the given documents

In [3]:
collectdoc_casefold=[]
for doc in collectdoc:
    collectdoc_casefold.append(doc.casefold())
print(collectdoc_casefold)

['gangs of wasseypur is a great movie. wasseypur is a town in bihar.', 'the success of a song depends on the music. ', 'there is a movie releasing this week. the movie is fun to watch.']


### Tokenize

In [19]:
 import nltk

In [4]:
from nltk.tokenize import word_tokenize

In [11]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jazee\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [5]:
collectdoc_token=[]
for doc in collectdoc_casefold:
    collectdoc_token.append(word_tokenize(doc))
print(collectdoc_token)

[['gangs', 'of', 'wasseypur', 'is', 'a', 'great', 'movie', '.', 'wasseypur', 'is', 'a', 'town', 'in', 'bihar', '.'], ['the', 'success', 'of', 'a', 'song', 'depends', 'on', 'the', 'music', '.'], ['there', 'is', 'a', 'movie', 'releasing', 'this', 'week', '.', 'the', 'movie', 'is', 'fun', 'to', 'watch', '.']]


## Removal of stopwords

In [6]:
from nltk.corpus import stopwords

In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jazee\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [7]:
stop_words = set(stopwords.words('english'))

collectdoc_filtered = []
for doc in collectdoc_token:
    filtered_doc = []
    for word in doc:
        if word not in stop_words and word not in ",.;:":
            filtered_doc.append(word)
    collectdoc_filtered.append(filtered_doc)
print(collectdoc_filtered)

[['gangs', 'wasseypur', 'great', 'movie', 'wasseypur', 'town', 'bihar'], ['success', 'song', 'depends', 'music'], ['movie', 'releasing', 'week', 'movie', 'fun', 'watch']]


## Stemming

In [8]:
from nltk.stem import PorterStemmer

In [9]:
ps = PorterStemmer() #removing the commoner morphological and inflexional endings from words in English
collectdoc_stemmed = []
for doc in collectdoc_filtered:
    stem_doc = []
    for word in doc:
        stem_doc.append(ps.stem(word))
    collectdoc_stemmed.append(stem_doc)
print(collectdoc_stemmed)

[['gang', 'wasseypur', 'great', 'movi', 'wasseypur', 'town', 'bihar'], ['success', 'song', 'depend', 'music'], ['movi', 'releas', 'week', 'movi', 'fun', 'watch']]


# TD-IDF Vectorization

## Frequency Count

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
#Scikit-learn's CountVectorizer :convert a collection of text documents to a vector of term/token counts

In [11]:
def empty(doc):
    return doc

In [15]:
count_vector = CountVectorizer(tokenizer=empty,preprocessor=empty)
word_count_vector = count_vector.fit_transform(collectdoc_stemmed)

In [16]:
word_count_vector.toarray()#Generated frequency of words in a document

array([[1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 1, 1]], dtype=int64)

In [17]:
count_vector.get_feature_names()#words for which the frequencies are generated.

['bihar',
 'depend',
 'fun',
 'gang',
 'great',
 'movi',
 'music',
 'releas',
 'song',
 'success',
 'town',
 'wasseypur',
 'watch',
 'week']

## Tf-idf Transformer Initialization

### Convert frequency of words matrix to Tf-idf matrix


In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
# Tfidftransformer will compute word counts using CountVectorizer 
#and then compute the IDF values and only then compute the Tf-idf scores

In [19]:
tfidf_trans = TfidfTransformer(smooth_idf=True,use_idf=True)

In [20]:
tfidf_trans_fit = tfidf_trans.fit(word_count_vector)

#### Weights for IDF 

In [21]:
tfidf_trans.idf_  #here idf values for each word for all documents are same .

array([1.69314718, 1.69314718, 1.69314718, 1.69314718, 1.69314718,
       1.28768207, 1.69314718, 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.69314718, 1.69314718, 1.69314718])

## TF-IDF Vector

In [22]:
tf_idf_vector=tfidf_trans_fit.transform(word_count_vector)

In [23]:
tf_idf_vector.toarray() #returns an ndarray

array([[0.34142622, 0.        , 0.        , 0.34142622, 0.34142622,
        0.25966344, 0.        , 0.        , 0.        , 0.        ,
        0.34142622, 0.68285244, 0.        , 0.        ],
       [0.        , 0.5       , 0.        , 0.        , 0.        ,
        0.        , 0.5       , 0.        , 0.5       , 0.5       ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.39798027, 0.        , 0.        ,
        0.60534851, 0.        , 0.39798027, 0.        , 0.        ,
        0.        , 0.        , 0.39798027, 0.39798027]])

## Creating DataFrames using Pandas

In [54]:
import pandas as pd

In [73]:
feature_names = count_vector.get_feature_names() 
finalresult= pd.DataFrame
def ind(n):
    row_names=[]
    for i in range(1,n+1):
       row_names.append('tf-idf of doc'+ str(i))
    return(row_names)
finalresult= pd.DataFrame(tf_idf_vector.toarray(), index=ind(len(collectdoc)), columns=feature_names)    
finalresult

Unnamed: 0,bihar,depend,fun,gang,great,movi,music,releas,song,success,town,wasseypur,watch,week
tf-idf of doc1,0.341426,0.0,0.0,0.341426,0.341426,0.259663,0.0,0.0,0.0,0.0,0.341426,0.682852,0.0,0.0
tf-idf of doc2,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0,0.0,0.0,0.0
tf-idf of doc3,0.0,0.0,0.39798,0.0,0.0,0.605349,0.0,0.39798,0.0,0.0,0.0,0.0,0.39798,0.39798
