## Comparing the proportion of rhyming lines

In this notebook, we will use the metric rhymes() defined in
lyrics_analysis.evaluation to compare the number of rhyming lines. We
will compare lyrics divided by two criteria: first, we will compare
individual genres, and second, we will compare actual song lyrics
to lyrics with randomly shuffled lines.

The rhymes() metric calculates the proportion of lines that rhyme
with the previous line. This means that a song where first line rhymes
with the second, the third one with the fourth one etc. would get 
a score of 0.5.

The rhymes() function takes two parameters: song lyrics and rhyme_level,
which is an integer indicating how many phonemes have to be identical
for the words to be considered rhymes. The default is 2.

In [1]:
import ijson
import lyrics_analysis.evaluation
import matplotlib.pyplot as plt
%matplotlib inline

Define a generator that will retrieve song lyrics and its genre
from a file.

In [2]:
import ijson

def retrieve_lyrics_and_genre(file):
    with open(file) as f:
        songs = ijson.items(f, 'item')
        for song in songs:
            yield song["lyrics"], song["genre"]
            

Define a dictionary that will store the scores.

In [3]:
scores_by_genre = {
    "rap": [],
    "pop": [],
    "rock": [],
    "r-b": [],
    "country": []
}

Now, parse the randomly selected set of n=10,000 songs and calculate
the proportion of rhymes for each one.

In [None]:
# TODO: this will have to be improved because 10,000 songs is way too many 
for lyrics, genre in retrieve_lyrics_and_genre("../data/cleaned/eval_set_10000_lyrics.json"):
    score = lyrics_analysis.evaluation.rhymes(lyrics)
    scores_by_genre[genre].append(score)
    

hey


Let's look at the average score for each genre.

In [None]:
for genre, scores in scores_by_genre.keys():
    print(genre, sum(scores)/len(scores))
    

Now, let's plot a histogram to see if there are any significant
differences.

In [None]:
n_bins = 20
genres = ["rap", "pop", "rock", "r-b", "country"]
data = [scores_by_genre[genre] for genre in genres]
plt.hist(data, bins=n_bins)
plt.legend(genres)
plt.xlabel("Score")
plt.ylabel("n")
