# Data Preprocessing

In this notebook, we're demonstrating our data query package implemented in `metalhistory/data_query_api`.
The package takes care of the API calls to LastFM and Musicbrainz form which we will obtain information about the albums we're interested in.


Start with the imports...


In [1]:
import requests
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import metalhistory.data_query_functions as dqf

In [2]:
#Use the pandas extension of tqdm for pretty progress bars
tqdm.pandas()

# Data Preprocessing
Read in a CSV file that has the following structure:

__id__|__artist__|__album__|__MA_Score__|

Wherte 'artist' refers to an artist's or band's name, 'album' refers to a name of an album release, and 'MA_Score' refers to the overall rating on Metal Archives.


In [12]:
df_csv = pd.read_csv('data/HoM.csv')
df_csv = df_csv.dropna(axis=1, how='all').drop('Unnamed: 16', axis=1)
df_csv

Unnamed: 0,Nr.,Band,Album,Jahr,Monat,Tag,MA
0,1,311,Evolver,2003,7.0,22,
1,2,311,From Chaos,2001,6.0,19,
2,3,311,Mosaic,2017,6.0,23,
3,4,311,Soundsystem,1999,10.0,12,
4,5,1349,Beyond the Apocalypse,2004,4.0,19,9.73
...,...,...,...,...,...,...,...
9355,9356,Zhrine,Unortheta,,,,4.43
9356,9357,Znöwhite,Act of God,,,,8.44
9357,9358,Zonaria,Infamy and the Breed,,,,3.89
9358,9359,Zyklon,World ov Worms,,,,4.03


In [17]:
print(9360/120)

78.0


In [18]:
df_csv.head(9360).tail(120)

Unnamed: 0,Nr.,Band,Album,Jahr,Monat,Tag,MA
9240,9241,Wolfheart,Wolves of Karelia,,,,3.43
9241,9242,Wolves in the Throne Room,Black Cascade,,,,11.63
9242,9243,Wolves in the Throne Room,Celestial Lineage,,,,7.40
9243,9244,Wolves in the Throne Room,Diadem of 12 Stars,,,,10.67
9244,9245,Wolves in the Throne Room,Malevolent Grain,,,,5.82
...,...,...,...,...,...,...,...
9355,9356,Zhrine,Unortheta,,,,4.43
9356,9357,Znöwhite,Act of God,,,,8.44
9357,9358,Zonaria,Infamy and the Breed,,,,3.89
9358,9359,Zyklon,World ov Worms,,,,4.03


## Extract Album information from LastFM and musicbrainz

Now, we are using the APIs from LastFM and musicbrainz to collect more information about the respective albums.
For demonstration, let's preprocess the albums on position 76 to 100.

*Note: Running it on the full list of 10000 albums would take a lot of time, as the APIs limit an excessive amounts of requests in a short period of time.*

In [29]:
%%time
lastfm = dqf.LastFM()

TAIL = 120
for HEAD in np.arange(0, len(df_csv)+TAIL, TAIL)[1:]:
    start = HEAD-TAIL
    stop = HEAD
    print(start, '-', stop)
    FIELDS = ['artist', 'name', 'release-date', 'listeners', 'playcount', 'tags', 'mbid', 'url', 'image']

    result_df = df_csv.head(HEAD).tail(TAIL)
    result_df['lastfm_info'] = result_df.progress_apply(lambda row: lastfm.get_album_info(artist=row['Band'], album=row['Album'], fields=FIELDS), axis=1)

    result_df = pd.concat([result_df, result_df.lastfm_info.apply(pd.Series)], axis=1)
    result_df = result_df.drop('lastfm_info', axis=1)
    # result_df.rename(columns={'name':'lastfm_album'}, inplace=True)

    start = HEAD-TAIL
    stop = HEAD
    result_df.to_csv('./data/processed/proc_MA_'+ str(start) + '-' + str(HEAD) + '_albums.csv')

0 - 120


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 5 seconds.
Response code 503. Waiting for 4 seconds.
120 - 240


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 3 seconds.
240 - 360


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
360 - 480


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
480 - 600


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 4 seconds.
600 - 720


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
720 - 840


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
840 - 960


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
960 - 1080


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 5 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
1080 - 1200


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
1200 - 1320


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
1320 - 1440


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
1440 - 1560


  0%|          | 0/120 [00:00<?, ?it/s]

1560 - 1680


  0%|          | 0/120 [00:00<?, ?it/s]

1680 - 1800


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
1800 - 1920


  0%|          | 0/120 [00:00<?, ?it/s]

1920 - 2040


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
2040 - 2160


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
2160 - 2280


  0%|          | 0/120 [00:00<?, ?it/s]

2280 - 2400


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
2400 - 2520


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
2520 - 2640


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
2640 - 2760


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
2760 - 2880


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 3 seconds.
2880 - 3000


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 4 seconds.
3000 - 3120


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3120 - 3240


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3240 - 3360


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3360 - 3480


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3480 - 3600


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3600 - 3720


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
3720 - 3840


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3840 - 3960


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
3960 - 4080


  0%|          | 0/120 [00:00<?, ?it/s]

4080 - 4200


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
4200 - 4320


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
4320 - 4440


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
4440 - 4560


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
4560 - 4680


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
4680 - 4800


  0%|          | 0/120 [00:00<?, ?it/s]

JSONDecodeError while querying for Kinstrife & Blood On Paths Long Forgotten...
Response code 503. Waiting for 3 seconds.
4800 - 4920


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
4920 - 5040


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
5040 - 5160


  0%|          | 0/120 [00:00<?, ?it/s]

5160 - 5280


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
5280 - 5400


  0%|          | 0/120 [00:00<?, ?it/s]

5400 - 5520


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
5520 - 5640


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
5640 - 5760


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
5760 - 5880


  0%|          | 0/120 [00:00<?, ?it/s]

5880 - 6000


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
6000 - 6120


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
6120 - 6240


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
6240 - 6360


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
6360 - 6480


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
6480 - 6600


  0%|          | 0/120 [00:00<?, ?it/s]

6600 - 6720


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
6720 - 6840


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 2 seconds.
6840 - 6960


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
6960 - 7080


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 4 seconds.
Response code 503. Waiting for 2 seconds.
7080 - 7200


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
7200 - 7320


  0%|          | 0/120 [00:00<?, ?it/s]

7320 - 7440


  0%|          | 0/120 [00:00<?, ?it/s]

7440 - 7560


  0%|          | 0/120 [00:00<?, ?it/s]

7560 - 7680


  0%|          | 0/120 [00:00<?, ?it/s]

7680 - 7800


  0%|          | 0/120 [00:00<?, ?it/s]

7800 - 7920


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
7920 - 8040


  0%|          | 0/120 [00:00<?, ?it/s]

8040 - 8160


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
8160 - 8280


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
8280 - 8400


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
8400 - 8520


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
8520 - 8640


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 3 seconds.
8640 - 8760


  0%|          | 0/120 [00:00<?, ?it/s]

8760 - 8880


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
8880 - 9000


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
9000 - 9120


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 3 seconds.
9120 - 9240


  0%|          | 0/120 [00:00<?, ?it/s]

Response code 503. Waiting for 2 seconds.
Response code 503. Waiting for 2 seconds.
9240 - 9360


  0%|          | 0/120 [00:00<?, ?it/s]

CPU times: user 3min 47s, sys: 13.6 s, total: 4min
Wall time: 1h 35min


## Preview the preprocessed data

Let's display some info we gathered about the metal albums placing on positions 76-100 of the top heavy metal albums of history.
We will take a look at the fields:

|__artist__|__album__|__release-date__|__listeners__|__playcount__|__tags__|__ignored tags__|

In [6]:
SHOW_FIELDS =['MA_artist','release-date', 'listeners', 'playcount', 'tags', 'ignored tags']
result_df[SHOW_FIELDS]

Unnamed: 0_level_0,Unnamed: 1_level_0,MA_artist,release-date,listeners,playcount,tags,ignored tags
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Slayer,Reign in Blood,Slayer,1986-10-07,826535,15664417,"[thrash metal, speed metal, heavy metal]","[albums i own, metal]"
Metallica,Kill 'Em All,Metallica,1983-07-25,583516,14405386,"[thrash metal, heavy metal, speed metal]","[albums i own, metal]"
,,Hades Archer,,,,,
Iron Maiden,Iron Maiden,Iron Maiden,1980-04-14,393721,7110933,[heavy metal],"[albums i own, nwobhm, metal, 1980]"
Metallica,Master of Puppets,Metallica,2003,952876,22537859,"[thrash metal, heavy metal]","[albums i own, metal, favourite albums]"
...,...,...,...,...,...,...,...
,,Ulver,,,,,
Suffocation,Pierced From Within,Suffocation,1995-05-23,53966,992878,"[death metal, technical death metal]","[brutal death metal, 1995, albums i own]"
Black Sabbath,Born Again,Black Sabbath,1983-08-07,84917,1224192,[heavy metal],"[hard rock, 1983, classic rock, albums i own]"
,,Cradle of Filth,,,,,


## Export
We can now export the preprocessed dataframe to a csv file.

In [6]:
start = HEAD-TAIL
stop = HEAD
result_df.sort_values(by='listeners', ascending=False).to_csv('./data/proc_MA_'+ str(start) + '-' + str(HEAD) + '_albums.csv')

# Next Steps
Now, that we have exported our data to a csv file, we can use our visualization functions to further explore the history of heavy metal. Follow the link below:


[1-visualizations.ipynb](1-visualizations.ipynb)