We will be working with a dataset that contains about 600,000 song lyrics with metadata, stored in the file `../data/song_lyrics.json`. For the purpose of testing, I created the files `song_lyrics10.json`, `song_lyrics1000.json` and `song_lyrics10000.json` with a semi-randomly picked subset of 10, 1000 and 10,000 lyrics.

Since the dataset is too large to be loaded into memory at once, we will be using the module `ijson` that creates a generator capable of returning the objects iteratively.

In [1]:
import ijson

In [2]:
with open("../data/song_lyrics10.json") as file:
    objects = ijson.items(file, "item")  
    # the 'item' prefix enables us to access each array element individually, rather than the whole array at once
    
    songs = [o for o in objects]

Let's look at what data the objects contain.

In [3]:
from pprint import pprint

In [4]:
pprint(songs[0])

{'album': 'The Mountain',
 'artist': 'Haken',
 'charts': 'N/A',
 'featured_artist': [],
 'genre': 'rock',
 'has_featured_annotation': 'false',
 'has_featured_video': 'false',
 'has_verified_callout': 'false',
 'is_music': 'true',
 'lyrics': ['Hearts will burn come what may  ',
            'With lessons learned along the way  ',
            'To free myself I make a choice  ',
            'Just to be heard I lose my voice  ',
            '  ',
            'Finding strength in solitude  ',
            'I fight to fly with much to prove  ',
            "Is this the way it's meant to be?  ",
            'I risk it all I will not fall  ',
            '  ',
            'Chorus  ',
            'Carry the weight of the world  ',
            'On my shoulders  ',
            'Rise to the challenge I set myself  ',
            '  ',
            'Solo  ',
            '  ',
            'Salvation waits without reprieve  ',
            "I'm on a razor's edge and it cuts my feet  ",
            'As go

Let's see how many artists and genres we have.

In [5]:
stats = {
    "genres": {},
    "authors": {}
}

with open("../data/song_lyrics10.json") as file:
    objects = ijson.items(file, "item")

    for object in objects:
        genre = object["genre"]
        artist = object["artist"]

        stats["genres"][genre] = stats["genres"].get(genre, 0) + 1
        stats["authors"][artist] = stats["authors"].get(artist, 0) + 1

pprint(stats)

{'authors': {'All Levels At Once': 1,
             'Apollo 3': 1,
             'Arthur H': 1,
             'Dying Passion': 1,
             'Haken': 1,
             'Halestorm': 1,
             'Haley Bonar': 1,
             'Kylie': 1,
             'Nuwance': 1,
             'Ywis': 1},
 'genres': {'pop': 9, 'rock': 1}}


In addition, we want to track the average number of lines and words in a song and words in a line.

In [6]:
with open("../data/song_lyrics10.json") as file:
    objects = ijson.items(file, "item")

    # we want to keep track of these metrics
    songs = 0
    lines = 0
    words = 0

    for object in objects:
        songs += 1
        lines += len(object["lyrics"])
        words += sum([len(line.split()) for line in object["lyrics"]])
        
print("average number of lines in a song:", lines / songs)
print("average number of words in a song:", words / songs)
print("average number of words in a line:", words / lines)

average number of lines in a song: 43.8
average number of words in a song: 223.7
average number of words in a line: 5.107305936073059
