<a href="https://colab.research.google.com/github/toche7/AI_ITM/blob/main/Lab12_TFIDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample set of documents
docs = [
    "the sky is blue",
    "sky is blue and sky is beautiful",
    "the beautiful sky is so blue",
    "I love blue cheese"
]

In [2]:
# Create the vectorizer and compute the TF-IDF matrix
tfidfvectorizer = TfidfVectorizer()

# Fit the model and transform the documents
tfidf_wm = tfidfvectorizer.fit_transform(docs)

In [3]:
print(tfidf_wm)

  (0, 2)	0.3992102058196136
  (0, 4)	0.4882913888670788
  (0, 6)	0.4882913888670788
  (0, 8)	0.6031370082211672
  (1, 1)	0.3473079263825201
  (1, 0)	0.44051606615876376
  (1, 2)	0.22987955785181605
  (1, 4)	0.5623513975308212
  (1, 6)	0.5623513975308212
  (2, 7)	0.5479699188774512
  (2, 1)	0.4320257780944028
  (2, 2)	0.2859534358554926
  (2, 4)	0.34976210104278727
  (2, 6)	0.34976210104278727
  (2, 8)	0.4320257780944028
  (3, 3)	0.6633846138519129
  (3, 5)	0.6633846138519129
  (3, 2)	0.34618161159873423


In [4]:
# Retrieve the terms found in the corpus
tfidf_tokens = tfidfvectorizer.get_feature_names_out()
tfidf_tokens


array(['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'so',
       'the'], dtype=object)

In [5]:
# Create a DataFrame for easy viewing
import pandas as pd
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),index = ['Doc1','Doc2', 'Doc3', 'Doc4'],columns = tfidf_tokens)

# View the TF-IDF DataFrame
print(df_tfidfvect)


           and  beautiful      blue    cheese        is      love       sky  \
Doc1  0.000000   0.000000  0.399210  0.000000  0.488291  0.000000  0.488291   
Doc2  0.440516   0.347308  0.229880  0.000000  0.562351  0.000000  0.562351   
Doc3  0.000000   0.432026  0.285953  0.000000  0.349762  0.000000  0.349762   
Doc4  0.000000   0.000000  0.346182  0.663385  0.000000  0.663385  0.000000   

           so       the  
Doc1  0.00000  0.603137  
Doc2  0.00000  0.000000  
Doc3  0.54797  0.432026  
Doc4  0.00000  0.000000  


This code snippet performs the following steps:

1. It initializes a list of strings where each string represents a document.
2. It creates a TfidfVectorizer object to convert the text documents to a matrix of TF-IDF features.
3. It fits to the data and then transforms our documents into the TF-IDF matrix.
4. It retrieves the tokens (words) that have been found in all the documents.
5. It creates a pandas DataFrame to display the TF-IDF scores in a tabular format, with documents as rows and words as columns.

The resulting DataFrame df_tfidfvect contains the TF-IDF scores for each word in each document. High scores indicate words that are more relevant to the document, while lower scores indicate words that are less relevant. Words that appear across many documents will have lower scores because of the IDF component, which penalizes common words.