<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **(Supplemental) Term Frequency - Inverse Document Frequency**


Estimated time needed: **15** minutes


As we've learned for non-negative matrix factorization, one application of this unsupervised dimensionality reduction is by applying it on a tf-idf matrix.

## Why tf-idf?

An intuitive way to describe this is that for a given term in a document, we multiply the count of that term in the document by the how rare that term is throughout all the documents we are looking at.

Imagine any corpus of data. You'll probably see many many words that appear in almost all documents, such as `the`, `and`, and `so`. If you wanted to quickly analyze text to find the most important words in documents, just looking at word counts isn't good enough. Those previous words would dominate the term frequency in volume and clutter our analysis.

By performing tf-idf, we can reduce the value assigned to these words that are really common in all our documents, and increase the value of words that may appear a lot in a certain document, but not frequently in other documents.

### Applications

WIth a tf-idf matrix, you can succintly capture important textual information from a large group of text documents. **A corpus is defined as a large structured set of text**. It gives you an efficient representation of what the important words are to each document, and potentially how the words can relate documents together.

**We will be using tf-idf matrices in the next lab!**

## What is tf-idf?

A tf-idf matrix is a `term frequency - inverse document frequency` matrix. Every row within this matrix will represent a document, and every column represents a term (a term could be a single word or an n-tuple of words such as *United States of America*). A tf-idf matrix is actually an augmented `term frequency`, `bag of words` or `document-term` matrix.

### What is `term frequency`?

A `term frequency` matrix simply counts the number of occurences of a given word within a document.


## Objectives

After completing this lab you will be able to:

*   Understand what term frequency and tf-idf matrices are
*   Explain the intuition behind both matrices and how they are calculated
*   Apply tf-idf to a corpus of text and find the most important word in each document


***


## Setup


For this lab, we will be using the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine-learning-pipeline related functions.


### Installing Required Libraries

The following required libraries are pre-installed in the Skills Network Labs environment. However, if you run these notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda), you will need to install these libraries by removing the `#` sign before `!mamba` in the code cell below.


In [1]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

The following required libraries are **not** pre-installed in the Skills Network Labs environment. **You will need to run the following cell** to install them:


In [41]:
import piplite
import micropip
await micropip.install(['skillsnetwork'])

### Importing Required Libraries

*We recommend you import all required libraries in one place (here):*


In [38]:
import re
import skillsnetwork
import pandas as pd
import numpy as np
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

<class 'ModuleNotFoundError'>: No module named 'joblib'

## Background


**Example**

Lets say we have two documents with one sentence each.

*   *"We like dogs and cats"*
*   *"We like cars and planes"*

If we vectorized these two documents into a `term frequency` matrix, we would get:

| doc | We | like | and | dogs | cats | cars | planes |
| --- | -- | ---- | --- | ---- | ---- | ---- | ------ |
| 0   | 1  | 1    | 1   | 1    | 1    | 0    | 0      |
| 1   | 1  | 1    | 1   | 0    | 0    | 1    | 1      |

We simply count the number of words in each document. (In sklearn, they sort the words alphabetically)

Lets convert this into a tf-idf matrix. The value of each element is run through the following function:

$\text{idf} = (\log \frac{N}{|{d\in D: t\in d}|} + 1)$

$\text{tfidf}(t,d, D) = f\_{t,d} \* \text{idf}$

Where:

*   $f\_{t,d}$ is the raw count of the term $t$ in document $d$
*   $N$ is the total number of documents in the corpus (num of all documents)
*   $|{d\in D: t\in d}|$ is the number of documents where the term $t$ appears
*   We add 1 to the idf portion such that any word that belongs in every document is not just ignored


### Converting to a tf-idf matrix:

For document 1 the tf-idf value for `like` would be $1 \* (\log(\frac{2}{2})+1) = 1$, but the tf-idf value for `dog` would be $1 \* (\log(\frac{2}{1})+1) = 1.693147$

If we computed this for every element, we would have:

| doc | We | like | and | dogs   | cats   | cars   | planes |
| --- | -- | ---- | --- | ------ | ------ | ------ | ------ |
| 0   | 1  | 1    | 1   | 1.6931 | 1.6931 | 0      | 0      |
| 1   | 1  | 1    | 1   | 0      | 0      | 1.6931 | 1.6931 |


### Doing it in code

This is the function from sklearn that can convert a list of document strings to a term frequency matrix.

```python
CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
```

This is the function that converts a term frequency matrix into a tf-idf matrix.

```python
TfidfTransformer(*, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
```

Lets implement the example above using these functions!


In [36]:
# Corpus
D = ["We like dogs and cats", "We like cars and planes"]

In [37]:
# Count Vectorizer creates a term frequency matrix
cv = CountVectorizer()
tf_mat = cv.fit_transform(D)
tf = pd.DataFrame(tf_mat.toarray(), columns = cv.get_feature_names_out())
tf

<class 'NameError'>: name 'CountVectorizer' is not defined

In [None]:
# Creating the tfidf matrix
tfidf_trans = TfidfTransformer(smooth_idf=False)
tfidf_mat = tfidf_trans.fit_transform(tf)
tfidf = pd.DataFrame(tfidf_mat.toarray(), columns = tfidf_trans.get_feature_names_out())

The tf-idf matrix created above by sklearn does some normalization such that the norm (length) of each document vector (row) is 1. We can instead take the idf vector trained on our data and apply it directly to the term frequency matrix to get the non-normalized tf-idf matrix.


In [None]:
# Non-normalized tf-idf
pd.DataFrame(tfidf_trans.idf_ * tf.to_numpy(), columns = tfidf_trans.get_feature_names_out())

In [None]:
# Normalized tf-idf
tfidf

*Note: These values are different from the ones we manually calculated as sklearn normalizes each document vector.*

*I.e. $\overrightarrow{d} \cdot \overrightarrow{d} = 1$*


In [None]:
# d
print(tfidf.iloc[0,:])
# d * d
np.multiply(tfidf.iloc[0,:], tfidf.iloc[0,:]).sum().round()

# Exercises

Lets try creating a tf-idf matrix ourselves! Below we have loaded a [dataset from kaggle](https://www.kaggle.com/datasets/vivmankar/physics-vs-chemistry-vs-biology?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01) of text, made up of news documents. This is an open domain dataset that is free to use.


Let's start by loading the data into a `pandas.DataFrame`:

Since you're using `jupyterlite`, you will need to use the following method to load datasets:


In [None]:
URL = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/tfidf.csv'
await skillsnetwork.download_dataset(URL)
# df = pd.read_json('tfidf.json')
df = pd.read_csv('tfidf.csv').iloc[:,1]

Let's look at some samples rows from the dataset we loaded:


In [None]:
df.head(5)

## Exercise 1 - Count Vectorizering our text

Convert this matrix of documents into a term frequency matrix. Note that this dataset has numbers, and we want to remove them for simplicity sake.

Use the following function and plug it into `CountVectorizer(preprocessor=preprocess_text)` as an argument.

We also want to limit the Countvectorizer to just the top 500 words using the `max_features` argument.

**Apply the `CountVectorizer` to the `df` Series and name the columns to the features from the `cv.get_feature_names_out()` function**


In [None]:
# Lets remove the numbers
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text

In [None]:
# Your solution here
cv = CountVectorizer(max_features = 500, preprocessor = preprocess_text)
tf = cv.fit_transform(df)
pd.DataFrame(tf.toarray(), columns = cv.get_feature_names_out())

<details>
    <summary>Click here for Solution</summary>

```python
cv = CountVectorizer(max_features = 500, preprocessor = preprocess_text)
tf = cv.fit_transform(df)
pd.DataFrame(tf.toarray(), columns = cv.get_feature_names_out())
```

</details>


## Exercise 2 - Applying the tf-idf transformer

Now that we have a term frequency matrix, we can apply the tf-idf function to it in order to obtain a matrix where the values represent how important a certain word is to their documents.

**Apply the TfidfTransformer to the `tf` matrix and name the columns to the features from `CountVectorizer.get_feature_names_out()`**


In [None]:
# Your solution here
tfidf_trans = TfidfTransformer()
tfidf_mat = tfidf_trans.fit_transform(tf.toarray())
tfidf = pd.DataFrame(tfidf_mat.toarray(), columns = cv.get_feature_names_out())
tfidf

<details>
    <summary>Click here for Solution</summary>

```python
tfidf = TfidfTransformer()
tfidf_mat = tfidf.fit_transform(tf.toarray())
pd.DataFrame(tfidf_mat.toarray(), columns = cv.get_feature_names_out())
```

</details>


### Dense format matrices

As we can see above, both the term frequency and tf-idf matrices contain a lot of 0's. When dealing with very large corpus of text, or a corpus with a large amount of unique words/features, we will often store the information in a dense format. This saves us space in RAM, as well as reduces the sparsity of the original matrix.


**Normal format:**

| doc | apple | orange | pear |
| --- | ----- | ------ | ---- |
| 0   | 0.5   | 0.3    | 0    |
| 1   | 0     | 0      | 0.4  |

**Dense format:**

| doc | word   | TFIDF Value |
| --- | ------ | ----------- |
| 0   | apple  | 0.5         |
| 0   | orange | 0.3         |
| 1   | pear   | 0.4         |


In code:


In [None]:
tfidf

In [None]:
dense_tfidf = tfidf.stack()
dense_tfidf[dense_tfidf != 0]

#### Congratulations!

You've successfully completed the optional tf-idf lab, in which you learned how tf-idf matrices are created. In the next lab, you'll be using these as a starting point. Enjoy!


## Authors


[Richard Ye](https://linkedin.com/in/richard-yehttps://linkedin.com/in/richard-ye?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01) is a undergraduate student at the University of Toronto studying Statistics and Finance.


### Other Contributors


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description |
| ----------------- | ------- | ---------- | ------------------ |
| 2022-06-03        | 0.1     | Richard Ye | Create Lab         |
| 2022-06-21        | 0.2     | Steve Hord | QA Pass            |


Copyright © 2022 IBM Corporation. All rights reserved.
