<p style="font-family:Roboto; font-size: 26px; color: magenta"> 1.3 - Understanding TF-IDF (Term Frequency-Inverse Document Frequency)</p>

<p style="font-family:Consolas; font-size: 18px; color: lightgreen"> Converting Text into vectors with TF-IDF : Example</p>

<p style="font-family:Consolas; font-size: 18px; color: lightgreen"> Imagine we have a corpus (a collection of documents) with three documents:</p>
<p style="font-family:Consolas; font-size: 18px; color: lightgreen"> 1. Document 1: "The cat sat on the mat."</p>
<p style="font-family:Consolas; font-size: 18px; color: lightgreen"> 2. Document 2: "The dog played in the park."</p>
<p style="font-family:Consolas; font-size: 18px; color: lightgreen"> 3. Document 3: "Cats and dogs are great pets."</p>

<p style="font-family:Roboto; font-size: 26px; color: magenta"> Step 1: Calculate Term Frequency (TF)</p>

> For Document 1:

* The word "cat" appears 1 time.
* The total number of terms in Document 1 is 6 ("the", "cat", "sat", "on", "the", "mat").
* So, TF(cat,Document 1) = 1/6

> For Document 2:

* The word "cat" does not appear.
* So, TF(cat,Document 2)=0.

> For Document 3:

* The word "cat" appears 1 time (as "cats").
* The total number of terms in Document 3 is 6 ("cats", "and", "dogs", "are", "great", "pets").
* So, TF(cat,Document 3)=1/6

In [None]:
# In Document 1 and Document 3, the word "cat" has the same TF score. 
# This means it appears with the same relative frequency in both documents.

# In Document 2, the TF score is 0 because the word "cat" does not appear.

<p style="font-family:Roboto; font-size: 26px; color: magenta"> Step 2: Calculate Inverse Document Frequency (IDF)</p>

* Total number of documents in the corpus (D): 3.
* Number of documents containing the term "cat": 2 (Document 1 and Document 3).

In [1]:
# The IDF score for "cat" is relatively low. This indicates that the word "cat" is not very rare
#  in the corpus—it appears in 2 out of 3 documents. 

# If a term appeared in only 1 document, its IDF score would be higher, indicating greater uniqueness.

<p style="font-family:Roboto; font-size: 26px; color: magenta"> Step 3: Calculate TF-IDF</p>

The TF-IDF score for "cat" is 0.029 in Document 1 and Document 3, and 0 in Document 2 that reflects both the frequency of the term in the document (TF) and its rarity across the corpus (IDF).

<p style="font-family:Roboto; font-size: 26px; color: magenta"> Why is TF-IDF Useful in This Example?</p>

* 1. Identifying Important Terms:
* 2. Filtering Common Words: 
* 3. Highlighting Unique Terms:

<p style="font-family:Roboto; font-size: 26px; color: magenta"> Implementing TF-IDF in Sklearn with Python</p>

In [2]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Collect strings from documents and create a corpus having a collection of strings from the documents d0, d1, and d2.
# assign documents
d0 = 'Geeks for geeks'
d1 = 'Geeks'
d2 = 'r2j'

# merge documents into a single corpus
string = [d0, d1, d2]

In [10]:
# Get tf-idf values from fit_transform() method.
# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)


In [11]:
# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)


idf values:
for : 1.6931471805599454
geeks : 1.2876820724517808
r2j : 1.6931471805599454


In [12]:
# Display tf-idf values along with indexing.
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


Word indexes:
{'geeks': 1, 'for': 0, 'r2j': 2}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (3, 3)>
  Coords	Values
  (0, 1)	0.8355915419449176
  (0, 0)	0.5493512310263033
  (1, 1)	1.0
  (2, 2)	1.0

tf-idf values in matrix form:
[[0.54935123 0.83559154 0.        ]
 [0.         1.         0.        ]
 [0.         0.         1.        ]]


<p style="font-family:Roboto; font-size: 26px; color: magenta"> Example 1: Below is the complete program based on the above approach:</p>

In [13]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
d0 = 'Geeks for geeks'
d1 = 'Geeks'
d2 = 'r2j'

# merge documents into a single corpus
string = [d0, d1, d2]

# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


idf values:
for : 1.6931471805599454
geeks : 1.2876820724517808
r2j : 1.6931471805599454

Word indexes:
{'geeks': 1, 'for': 0, 'r2j': 2}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (3, 3)>
  Coords	Values
  (0, 1)	0.8355915419449176
  (0, 0)	0.5493512310263033
  (1, 1)	1.0
  (2, 2)	1.0

tf-idf values in matrix form:
[[0.54935123 0.83559154 0.        ]
 [0.         1.         0.        ]
 [0.         0.         1.        ]]


<p style="font-family:Roboto; font-size: 26px; color: magenta"> Example 2: Here, tf-idf values are computed from a corpus having unique values. </p>

In [14]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
d0 = 'geek1'
d1 = 'geek2'
d2 = 'geek3'
d3 = 'geek4'

# merge documents into a single corpus
string = [d0, d1, d2, d3]

# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf values:')
print(result)


Word indexes:
{'geek1': 0, 'geek2': 1, 'geek3': 2, 'geek4': 3}

tf-idf values:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (4, 4)>
  Coords	Values
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


<p style="font-family:Roboto; font-size: 26px; color: magenta"> Example 3: In this program, tf-idf values are computed from a corpus having similar documents.</p>

In [15]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
d0 = 'Geeks for geeks!'
d1 = 'Geeks for geeks!'


# merge documents into a single corpus
string = [d0, d1]

# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf values:')
print(result)


Word indexes:
{'geeks': 1, 'for': 0}

tf-idf values:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (2, 2)>
  Coords	Values
  (0, 1)	0.8944271909999159
  (0, 0)	0.4472135954999579
  (1, 1)	0.8944271909999159
  (1, 0)	0.4472135954999579


<p style="font-family:Roboto; font-size: 26px; color: magenta"> Example 4: Below is the program in which we try to calculate tf-idf value of a single word geeks is repeated multiple times in multiple documents.</p>

In [16]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

# assign corpus
string = ['Geeks geeks']*5

# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf values:')
print(result)


Word indexes:
{'geeks': 0}

tf-idf values:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (5, 1)>
  Coords	Values
  (0, 0)	1.0
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 0)	1.0
