# Test the Quality of the rhymes() Metric

The goal of this notebook is to see how well the rhymes()
metric from the lyrics_analysis module performs. To test
this, we will check how well it ranks different
evaluation sets.

The idea is that while it might not rank individual songs
in the same order as a human would, it should perform
similarly on average. For example, one might expect that
actual song lyrics will get a higher score on average
than, for example, random texts.

We will test the quality of our metric this way. We will
have three sets of 1000 songs, three sets of 1000 lyrics
with last words on lines randomly replaced by other words,
and three sets of 1000 songs with last words on line
randomly removed.

We will not compare the sets from the same category,
since they are expected to have a similar average score.
Instead, we will check how many of the expected relations
are preserved: all sets of actual songs should be scored
higher than all other sets and all sets with replaced
words should be scored higher than the sets with removed
words. This gives us 27 expected ordered pairs. The
number we will get says what proportion of these 27 pairs
was preserved after running the metric.

### Generating the data

First, we will need to generate the evaluation sets.
In the lyrics_analysis.sampler submodule, we have the
function sample_n_songs_from_generator, which will come
in handy now.

We will use the entire dataset to select random songs.
Some of them will be saved as-is, others will be modified
so that they fit our needs.

In [1]:
# imports
import json
import ijson
import functools

import lyrics_analysis

                the kernel may be left running.  Please let us know
                about your system (bitness, Python, etc.) at
                ipython-dev@scipy.org
  ipython-dev@scipy.org""")


In [2]:
# define the generator that will yield JSON song data
def generator():
    with open("../data/cleaned/song_lyrics_english_only.json") as file:
        for item in ijson.items(file, "item"):
            yield item
            

In [3]:
# set up what we want to generate
funcs = [
    lambda x: x,
    lyrics_analysis.modifications.remove_last_word_on_line,
    functools.partial(lyrics_analysis.modifications.replace_last_words_on_line, words=["porcupine", "armadillo", "anteater"])
]

func_names = [
    "_original",
    "_replaced_words",
    "_removed_words"
]

path = "../data/metric_quality_tests/"
prefix = "eval_set_"

# generate the files
for i in range(3):
    for func, func_name in zip(funcs, func_names):
        dest = path + prefix + func_name + str(i+1) + ".json"
        examples = []
        for example in lyrics_analysis.sampler.sample_n_songs_from_generator(1000, generator, total_n=100000):
            example["lyrics"] = func(example["lyrics"])
            examples.append(example)
        with open(dest, 'w') as out_file:
            json.dump(examples, out_file)
        print("Finished set #%s of %s" % (i+1, func_name))

Finished set #1 of _original
Finished set #1 of _replaced_words
Finished set #1 of _removed_words
Finished set #2 of _original
Finished set #2 of _replaced_words
Finished set #2 of _removed_words
Finished set #3 of _original
Finished set #3 of _replaced_words
Finished set #3 of _removed_words


### Testing the metric

Now that we have our evaluation sets ready, we can proceed
to testing the metric itself. In the lyrics_analysis.tests
submodule, there is the function `calculate_metric_quality`
that performs the test as described above. It takes a
reference to the metric and a list of lists of song
generators as parameters. The lists of generators have
to be sorted from the lowest expected score.

We need to create the song generators from the evaluation
sets. For this, we will use the `_get_generator_from_file`
helper function.

In [8]:
evaluation_sets = [
    ["eval_set__removed_words1.json",
     "eval_set__removed_words2.json",
     "eval_set__removed_words3.json"],
    ["eval_set__replaced_words1.json",
     "eval_set__replaced_words2.json",
     "eval_set__replaced_words3.json"],
    ["eval_set__original1.json",
     "eval_set__original2.json",
     "eval_set__original3.json"]
]
directory = "../data/metric_quality_tests/"

eval_genarators = [
    [
        lyrics_analysis.helpers._get_generator_from_file(directory + file) for file in set
    ]
    for set in evaluation_sets
]

Finally, we can test the metric.

In [9]:
score = lyrics_analysis.tests.calculate_metric_quality(
    lyrics_analysis.evaluation.rhymes,
    eval_genarators
)

print(score)

0.8518518518518519


We see that in 85% of the cases, the metric ranked the pair as expected.