# TF-IDF weighting

In this exercise, you'll have to compute term weightings (TF, IDF, and TF-IDF) based on a document-term matrix.

In [None]:
from typing import List
import ipytest
import math
import pytest

ipytest.autoconfig()

The document-term vector contains the raw term frequencies for each term in the document.

In [None]:
DOC_TERM_MATRIX = [
    [0, 0, 3, 0, 0, 0],
    [1, 1, 2, 0, 0, 0],
    [0, 0, 2, 1, 1, 0],
    [0, 0, 0, 1, 1, 0],
    [1, 1, 1, 0, 1, 1]
]

## Task 1: TF weighting

Compute the L1-normalized term frequency vector for a given document.

The L1-normalized frequency of a single term in a document is given by:

$$tf_{t,d}=\frac{c_{t,d}}{|d|}$$ 

where $c_{t,d}$ is the count of occurrences of term $t$ in document $d$ and $|d|$ is the document length (total number of terms).

In [None]:
def get_tf_vector(doc_term_vector: List[int]) -> List[float]:    
    """Computes the normalized term frequency vector from a raw term-frequency vector."""
    # TODO Complete this method.
    return [0] * len(doc_term_vector)

Tests.

In [None]:
%%run_pytest[clean]

def test_tf_doc0():
    assert get_tf_vector(DOC_TERM_MATRIX[0]) == [0, 0, 1, 0, 0, 0]
    
def test_tf_doc1():
    assert get_tf_vector(DOC_TERM_MATRIX[1]) == [0.25, 0.25, 0.5, 0, 0, 0]

## Task 2: IDF weighting

Compute the IDF weight of a term given by

$$idf_{t}=\log \frac{N}{n_t}$$ 

where $N$ is the total number of documents and $n_t$ is the number of documents that contain term $t$.
**Use base-10 logarithm in this exercise.**

In [None]:
def get_term_idf(doc_term_matrix: List[List[int]], term_index: int) -> float:
    """Computes the IDF value of a term, given by its index, based on a document-term matrix."""
    # TODO Complete this method.
    return 0

Tests.

In [None]:
%%run_pytest[clean]

def test_idf_term0():
    assert get_term_idf(DOC_TERM_MATRIX, 0) == pytest.approx(0.3979, rel=1e-3)
    
def test_idf_term2():
    assert get_term_idf(DOC_TERM_MATRIX, 2) == pytest.approx(0.0969, rel=1e-3)

## Task 3: TF-IDF weighting

Compute the TF-IDF vector for a given document, where the TF-IDF weight of a term in a document is given by:

$$ tfidf_{t,d} = tf_{t,d} \times idf_{t}$$

In [None]:
def get_tfidf_vector(doc_term_matrix: List[List[int]], doc_index: int) -> List[float]:
    """Computes the TFIDF vector from a raw term-frequency vector."""
    # TODO Complete this method.
    return [0] * len(doc_term_vector[doc_index])

Tests.

In [None]:
%%run_pytest[clean]

def test_tfidf_doc0():
    assert get_tfidf_vector(DOC_TERM_MATRIX, 0) == pytest.approx([0, 0, 0.0969, 0, 0, 0], rel=1e-3)