## TF-IDF

In all the other approaches we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query. 

This notebook shows a simple example of how to get the TF-IDF representation of a document using sklearn's TfidfVectorizer. 

In [3]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install scikit-learn==0.21.3

# ===========================

Collecting scikit-learn==0.21.3
  Downloading scikit-learn-0.21.3.tar.gz (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: scikit-learn
  Building wheel for scikit-learn (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[40 lines of output][0m
  [31m   [0m Partial import of sklearn during the build process.
  [31m   [0m 
  [31m   [0m   `numpy.distutils` is deprecated since NumPy 1.23.0, as a result
  [31m   [0m   of the deprecation of `distutils` itself. It will be removed for
  [31m   [0m   Python >= 3.12. For older Python versions it will remain present.
  [31m   [0m   It is recommended to use `setup

In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [7]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food and barks."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food and barks']

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)
#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names_out())
#print("-"*10)

#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

IDF for all words in the vocabulary [1.91629073 1.91629073 1.51082562 1.22314355 1.51082562 1.91629073
 1.22314355 1.91629073]
----------
All words in the vocabulary ['and' 'barks' 'bites' 'dog' 'eats' 'food' 'man' 'meat']
TFIDF representation for all documents in our corpus
 [[0.         0.         0.65782931 0.53256952 0.         0.
  0.53256952 0.        ]
 [0.         0.         0.65782931 0.53256952 0.         0.
  0.53256952 0.        ]
 [0.         0.         0.         0.44809973 0.55349232 0.
  0.         0.70203482]
 [0.49819711 0.49819711 0.         0.         0.39278432 0.49819711
  0.31799276 0.        ]]
----------
Tfidf representation for 'dog and man are friends':
 [[0.74230628 0.         0.         0.47380449 0.         0.
  0.47380449 0.        ]]


We will see how this representation can be used for text classification later in chapter 4!

## Only TF

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer


tf = TfidfVectorizer(use_idf=False, lowercase=True)
bow_rep_tf = tf.fit_transform(processed_docs)

#IDF for all words in the vocabulary
#print("IDF for all words in the vocabulary",tf.idf_)
#print("-"*10)
#All words in the vocabulary.
print("All words in the vocabulary",tf.get_feature_names_out())
#print("-"*10)

#TFIDF representation for all documents in our corpus
print("TFIDF representation for all documents in our corpus\n",bow_rep_tf.toarray())
print("-"*10)

temp = tf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

All words in the vocabulary ['and' 'barks' 'bites' 'dog' 'eats' 'food' 'man' 'meat']
TFIDF representation for all documents in our corpus
 [[0.         0.         0.57735027 0.57735027 0.         0.
  0.57735027 0.        ]
 [0.         0.         0.57735027 0.57735027 0.         0.
  0.57735027 0.        ]
 [0.         0.         0.         0.57735027 0.57735027 0.
  0.         0.57735027]
 [0.4472136  0.4472136  0.         0.         0.4472136  0.4472136
  0.4472136  0.        ]]
----------
Tfidf representation for 'dog and man are friends':
 [[0.57735027 0.         0.         0.57735027 0.         0.
  0.57735027 0.        ]]
