
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing to evaluate the importance of a word in a document relative to a corpus. It is commonly used for feature extraction and text mining tasks.

TF (Term Frequency) measures the frequency of a term (word) in a document. It is calculated as the number of times a term appears in a document divided by the total number of terms in the document. The idea behind TF is to highlight terms that occur frequently within a document.

IDF (Inverse Document Frequency) measures the importance of a term across a corpus. It is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing the term. IDF assigns higher weights to terms that are rare across the corpus but occur frequently within specific documents.

The TF-IDF score for a term in a document is the product of its TF and IDF scores. It is calculated by multiplying the TF of the term in the document by the IDF of the term across the corpus. This results in a numerical representation of the importance of a term in a document relative to the entire corpus.

TF-IDF is often used for various NLP tasks such as information retrieval, document classification, and text summarization. It helps in identifying and prioritizing important words or phrases within documents, enabling more accurate analysis and understanding of textual data.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'Data Science is an overlap between Arts and Science',
'Generally, Arts graduates are right-brained and Science graduates are left-brained',
'Excelling in both Arts and Science at a time becomes difficult',
'Natural Language Processing is a part of Data Science'
]

1. `import pandas as pd`: This line is importing the `pandas` library as `pd`. The library is used for high-level data manipulation and analysis.

2. `from sklearn.feature_extraction.text import TfidfVectorizer`: It is importing the `TfidfVectorizer` class from `sklearn.feature_extraction.text`. This is a class used to convert a collection of raw documents to a matrix of TF-IDF features.TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

3. `corpus = [...`: It will define a list of sentences. A corpus is a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subjec

In [2]:
# We will create a TF-IDF model

tfidf_model = TfidfVectorizer()
print(tfidf_model.fit_transform(corpus).todense())

[[0.40332811 0.25743911 0.         0.25743911 0.         0.
  0.40332811 0.         0.         0.31798852 0.         0.
  0.         0.         0.         0.31798852 0.         0.
  0.         0.         0.40332811 0.         0.         0.
  0.42094668 0.        ]
 [0.         0.159139   0.49864399 0.159139   0.         0.
  0.         0.         0.49864399 0.         0.         0.
  0.24932199 0.49864399 0.         0.         0.         0.24932199
  0.         0.         0.         0.         0.         0.24932199
  0.13010656 0.        ]
 [0.         0.22444946 0.         0.22444946 0.35164346 0.35164346
  0.         0.35164346 0.         0.         0.35164346 0.35164346
  0.         0.         0.35164346 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.18350214 0.35164346]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.30887228 0.         0.
  0.         0.         0.         0.30887228 0.39176

1. `TfidfVectorizer()` is an utility class provided by `sklearn` (Scikit-Learn) library in Python, which is used to convert a collection of raw documents to a matrix of TF-IDF features. It's equivalent to CountVectorizer followed by TfidfTransformer.

2. `tfidf_model = TfidfVectorizer()`: an object of `TfidfVectorizer` is being created and is assigned to the variable `tfidf_model`.

3. `tfidf_model.fit_transform(corpus).todense()`: The `fit_transform()` function is applied on `corpus` which is a list of strings (sentences, paragraphs, etc.). The function will learn the vocabulary in the given text data, transform the data into a document-term matrix (DTM), calculate the tf-idf weights for each term in each document and return the tf-idf-weighted document-term matrix.

4. `.todense()` is called to convert the sparse matrix (which is the document-topic matrix) output by `fit_transform()` into a dense matrix for readability because the sparse matrix only contain information about non-zero elements, omitting any zero values to save space.

In [3]:
# We need to create a DataFrame from the genereated tf-idf matrix

tfidf_df = pd.DataFrame(tfidf_model.fit_transform(corpus).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,an,and,are,arts,at,becomes,between,both,brained,data,...,language,left,natural,of,overlap,part,processing,right,science,time
0,0.403328,0.257439,0.0,0.257439,0.0,0.0,0.403328,0.0,0.0,0.317989,...,0.0,0.0,0.0,0.0,0.403328,0.0,0.0,0.0,0.420947,0.0
1,0.0,0.159139,0.498644,0.159139,0.0,0.0,0.0,0.0,0.498644,0.0,...,0.0,0.249322,0.0,0.0,0.0,0.0,0.0,0.249322,0.130107,0.0
2,0.0,0.224449,0.0,0.224449,0.351643,0.351643,0.0,0.351643,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.183502,0.351643
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.308872,...,0.391765,0.0,0.391765,0.391765,0.0,0.391765,0.391765,0.0,0.204439,0.0


2. `tfidf_df.columns = sorted(tfidf_model.vocabulary_)`:This assigns the column names of the DataFrame `tfidf_df`. `tfidf_model.vocabulary_` is a dictionary where the keys are the words in the `corpus` and the values are the indices in the TF-IDF matrix corresponding to each word. `sorted(tfidf_model.vocabulary_)` sorts the words in alphabetical order.

3. `tfidf_df.head()`:This prints out the first 5 rows of the DataFrame. Each row represents a document in the `corpus`, and each column represents a word in the `corpus`. The value in cell (i, j) is the TF-IDF score of word j in document i.

In [4]:
# We will create a DataFrame from the tf-idf matrix for the most 10 frequent terms

tfidf_model_small = TfidfVectorizer(max_features=10)
tfidf_df_small = pd.DataFrame(tfidf_model_small.fit_transform(corpus).todense())
tfidf_df_small.columns = sorted(tfidf_model_small.vocabulary_)
tfidf_df_small.head()

Unnamed: 0,an,and,are,arts,brained,data,graduates,is,right,science
0,0.491042,0.313426,0.0,0.313426,0.0,0.387143,0.0,0.387143,0.0,0.512492
1,0.0,0.170061,0.532867,0.170061,0.532867,0.0,0.532867,0.0,0.266433,0.139036
2,0.0,0.612172,0.0,0.612172,0.0,0.0,0.0,0.0,0.0,0.500491
3,0.0,0.0,0.0,0.0,0.0,0.640434,0.0,0.640434,0.0,0.423897


1. `TfidfVectorizer(max_features=10)`: It will instantiate a TfidfVectorizer object. TF-IDF stands for Term Frequency-Inverse Document Frequency, a popular text processing technique that reflects how important a word is in the corpus. It's often employed to extract features from text while conducting text mining. The parameter `max_features=10` indicates the vectorizer will only use the top 10 tokens (words or group of words) ordered by term frequency across the corpus.

2. `tfidf_model_small.fit_transform(corpus).todense()`: it will fits the TF-IDF model to the corpus and then transforms the corpus into a matrix where each row represents a document and each column a word. Each cell in the matrix represents the tfidf score of a specific word in a specific document. The `todense()` function converts the matrix from Sparse Matrix format (usually used for large matrices that are mostly zero, to save space) to a conventional grid (dense) format.

3. `pd.DataFrame(tfidf_model_small.fit_transform(corpus).todense())`: it will convert the dense matrix into a pandas DataFrame which makes it easy to work with in the subsequent steps like machine learning model building.

4. `tfidf_df_small.columns = sorted(tfidf_model_small.vocabulary_)`: it will assign sorted words from the learnt vocabulary as column names of the DataFrame. Vocabulary in this context is a dictionary where keys are unique words and values are indices in the matrix.