## TF-IDF Exercise

This notebook demonstrates TF-IDF encoding on machine learning related sentences.

TF-IDF (Term Frequency-Inverse Document Frequency) aims to quantify the importance of a given word relative to other words in the document and in the corpus.

In [22]:
# Define the documents
documents = [
    "Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.",
    "Getting started in applied machine learning can be difficult, especially when working with real-world data.",
    "One good example is to use a one-hot encoding on categorical data."
]

# Preprocess the documents (lowercase and remove punctuation)
processed_docs = [doc.lower().replace(".", "").replace(",", "").replace("-", " ") for doc in documents]
processed_docs

['often machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model',
 'getting started in applied machine learning can be difficult especially when working with real world data',
 'one good example is to use a one hot encoding on categorical data']

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

# IDF for all words in the vocabulary
print("IDF for all words in the vocabulary", tfidf.idf_)
print("-" * 10)

# All words in the vocabulary
print("All words in the vocabulary", tfidf.get_feature_names_out())
print("-" * 10)

# TFIDF representation for all documents in the corpus
print("TFIDF representation for all documents in our corpus\n", bow_rep_tfidf.toarray())
print("-" * 10)

# Test with a new sentence
temp = tfidf.transform(["machine learning data"])
print("TFIDF representation for 'machine learning data':\n", temp.toarray())

IDF for all words in the vocabulary [1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.
 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718
 1.69314718 1.69314718 1.28768207 1.69314718 1.28768207 1.28768207
 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718
 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718
 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718
 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718]
----------
All words in the vocabulary ['applied' 'be' 'before' 'can' 'categorical' 'data' 'difficult' 'encoding'
 'especially' 'example' 'fitting' 'getting' 'good' 'hot' 'in' 'is'
 'learning' 'machine' 'model' 'often' 'on' 'one' 'or' 'prepare' 'real'
 'recommend' 'require' 'specific' 'started' 'that' 'to' 'tutorials' 'use'
 'ways' 'when' 'will' 'with' 'working' 'world' 'you' 'your']
----------
TFIDF representation for all documents in our corpus
 [[0.         0.         0.22057047 0.         0.         0.13

The TF-IDF representation shows the importance of each word in each document relative to the entire corpus. Words that appear frequently in one document but rarely across all documents will have higher TF-IDF scores.