# Testing the Quality of the tf_idf() Metric

In this notebook, we will use the pre-generated
evaluation sets to test the quality of the `tf_idf()`
metric.

The `tf_idf()` metric computes the average tfidf
score over all words in the lyrics. The tfidf
score of a word stands for *term frequency -
inverse document frequency* and determines how
important a word is in the context of a single
document and the whole corpus.

We use the entire song dataset as the corpus
from which the scores are computed. We used
the `gensim` library to create a TfidfModel
and stored it so as not to need to generate
it again every time it is needed.

The evaluation test sets were generated using
the `_generate_metric_quality_sets()` method
from the `helpers` submodule. We will use 3 sets
of unmodified lyrics, and for each probability
in {0.1, 0.5, 0.8} three sets of lyrics with
randomly removed words anywhere on the line
and three sets of lyrics with randomly removed
words at the end of line. Each test set contains
1000 songs.

To determine the quality of the metric, we will
compute the average score of each test set
and compare the relation to the expected relation.
For further explanation see the notebook
`05_test_rhyme_metric_quality.ipynb`.

In [1]:
import gensim
import functools
import lyrics_analysis

evaluation_sets = [
    [
        "1000_removed_all_words0.8_1.json",
        "1000_removed_all_words0.8_2.json",
        "1000_removed_all_words0.8_3.json"
    ],
    [
        "1000_removed_last_words0.8_1.json",
        "1000_removed_last_words0.8_2.json",
        "1000_removed_last_words0.8_3.json"
    ],
    [
        "1000_removed_all_words0.5_1.json",
        "1000_removed_all_words0.5_2.json",
        "1000_removed_all_words0.5_3.json"
    ],
    [
        "1000_removed_last_words0.5_1.json",
        "1000_removed_last_words0.5_2.json",
        "1000_removed_last_words0.5_3.json"
    ],
    [
        "1000_removed_all_words0.1_1.json",
        "1000_removed_all_words0.1_2.json",
        "1000_removed_all_words0.1_3.json"
    ],
    [
        "1000_removed_last_words0.1_1.json",
        "1000_removed_last_words0.1_2.json",
        "1000_removed_last_words0.1_3.json"
    ],
    [
        "1000_id0.1_1.json",
        "1000_id0.1_2.json",
        "1000_id0.1_3.json"
    ]
]

directory = "../data/metric_quality_tests/"

eval_generators = [
    [
        lyrics_analysis.helpers._get_generator_from_file(directory + file) for file in set
    ]
    for set in evaluation_sets
]

                the kernel may be left running.  Please let us know
                about your system (bitness, Python, etc.) at
                ipython-dev@scipy.org
  ipython-dev@scipy.org""")


Now, we retrieve the saved dictionary and tfidf model.

In [2]:
dict = gensim.corpora.Dictionary.load("../models/tfidf.dict")
model = gensim.models.TfidfModel.load("../models/tfidf.model")

To test the metric, we will use the `calculate_metric_quality()`
method from the `tests` submodule.

In [3]:
score = lyrics_analysis.tests.calculate_metric_quality(
    functools.partial(lyrics_analysis.evaluation.tf_idf, dictionary=dict, tfidf=model),
    eval_generators
)

print(score)


0.10582010582010581


We see that the metric performance is very poor,
but that might be because we defined the expected
order wrong.

In [4]:
eval_sets1 = [
    [
        "1000_removed_last_words0.8_1.json",
        "1000_removed_last_words0.8_2.json",
        "1000_removed_last_words0.8_3.json"
    ],
    [
        "1000_removed_last_words0.5_1.json",
        "1000_removed_last_words0.5_2.json",
        "1000_removed_last_words0.5_3.json"
    ],
    [
        "1000_removed_last_words0.1_1.json",
        "1000_removed_last_words0.1_2.json",
        "1000_removed_last_words0.1_3.json"
    ]
]

directory = "../data/metric_quality_tests/"

eval_generators = [
    [
        lyrics_analysis.helpers._get_generator_from_file(directory + file) for file in set
    ]
    for set in eval_sets1
]

score = lyrics_analysis.tests.calculate_metric_quality(
    functools.partial(lyrics_analysis.evaluation.tf_idf, dictionary=dict, tfidf=model),
    eval_generators
)

print(score)

0.07407407407407407


In [5]:
eval_sets2 = [
    [
        "1000_removed_all_words0.8_1.json",
        "1000_removed_all_words0.8_2.json",
        "1000_removed_all_words0.8_3.json"
    ],
    [
        "1000_removed_all_words0.5_1.json",
        "1000_removed_all_words0.5_2.json",
        "1000_removed_all_words0.5_3.json"
    ],
    [
        "1000_removed_all_words0.1_1.json",
        "1000_removed_all_words0.1_2.json",
        "1000_removed_all_words0.1_3.json"
    ]
]

directory = "../data/metric_quality_tests/"

eval_generators = [
    [
        lyrics_analysis.helpers._get_generator_from_file(directory + file) for file in set
    ]
    for set in eval_sets2
]

score = lyrics_analysis.tests.calculate_metric_quality(
    functools.partial(lyrics_analysis.evaluation.tf_idf, dictionary=dict, tfidf=model),
    eval_generators
)

print(score)



0.0
