# Data Preprocessing

In this notebook, we're demonstrating our data query package implemented in `metalhistory/data_query_api`.
The package takes care of the API calls to LastFM and Musicbrainz form which we will obtain information about the albums we're interested in.


Start with the imports...


In [1]:
import requests
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import metalhistory.data_query_functions as dqf

In [2]:
#Use the pandas extension of tqdm for pretty progress bars
tqdm.pandas()

# Data Preprocessing
Read in a CSV file that has the following structure:

__id__|__artist__|__album__|__MA_Score__|

Wherte 'artist' refers to an artist's or band's name, 'album' refers to a name of an album release, and 'MA_Score' refers to the overall rating on Metal Archives.


In [3]:
df_csv = pd.read_csv('data/MA_10k_albums.csv')
df_csv

Unnamed: 0,artist,album,MA_score
0,Slayer,Reign in Blood,36.01
1,Metallica,Kill 'Em All,33.39
2,Hades Archer,Penis Metal,32.67
3,Iron Maiden,Iron Maiden,32.38
4,Metallica,Master of Puppets,31.83
...,...,...,...
9995,Iron Maiden,Live at the Rainbow,1.92
9996,Jorn,Worldchanger,1.92
9997,Juggernaut,Trouble Within,1.92
9998,Lacrimas Profundere,Memorandum,1.92


## Extract Album information from LastFM and musicbrainz

Now, we are using the APIs from LastFM and musicbrainz to collect more information about the respective albums.
For demonstration, let's preprocess the albums on position 76 to 100.

*Note: Running it on the full list of 10000 albums would take a lot of time, as the APIs limit an excessive amounts of requests in a short period of time.*

In [4]:
lastfm = dqf.LastFM()

HEAD = 100
TAIL = 25
FIELDS = ['artist', 'name', 'release-date', 'listeners', 'playcount', 'tags', 'mbid', 'url', 'image']

result_df = df_csv.head(HEAD).tail(TAIL)
result_df['lastfm_info'] = result_df.progress_apply(lambda row: lastfm.get_album_info(artist=row['artist'], album=row['album'], fields=FIELDS), axis=1)

result_df= result_df.lastfm_info.apply(pd.Series)
result_df['MA_score'] = df_csv.head(HEAD).tail(TAIL)['MA_score']
result_df['MA_artist'] = df_csv.head(HEAD).tail(TAIL)['artist']
result_df['MA_album'] = df_csv.head(HEAD).tail(TAIL)['album']

result_df.rename(columns={'name':'album'}, inplace=True)
result_df = result_df.set_index(['artist', 'album'])

  0%|          | 0/25 [00:00<?, ?it/s]

Response code 503. Waiting for 4 seconds.


## Preview the preprocessed data

Let's display some info we gathered about the metal albums placing on positions 76-100 of the top heavy metal albums of history.
We will take a look at the fields:

|__artist__|__album__|__release-date__|__listeners__|__playcount__|__tags__|__ignored tags__|

In [5]:
SHOW_FIELDS =['release-date', 'listeners', 'playcount', 'tags', 'ignored tags']
result_df[SHOW_FIELDS]

Unnamed: 0_level_0,Unnamed: 1_level_0,release-date,listeners,playcount,tags,ignored tags
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dimmu Borgir,Death Cult Armageddon,2003-09-08,208103.0,4711508.0,"[symphonic black metal, black metal]","[albums i own, melodic black metal, metal]"
,,,,,,
Iron Maiden,The X Factor,1995-10-02,167862.0,3235619.0,[heavy metal],"[albums i own, metal, 1995, nwobhm]"
Suffocation,Effigy of the Forgotten,1991-10-22,61551.0,1211292.0,"[death metal, technical death metal]","[brutal death metal, old school death metal, 1..."
Testament,The New Order,1988-05-05,200357.0,2041217.0,"[thrash metal, heavy metal]","[albums i own, 1988, metal]"
Death,Leprosy,1988-11-16,115061.0,2484759.0,[death metal],"[albums i own, old school death metal, 1988, m..."
Opeth,Still Life,1999-10-18,208677.0,4757870.0,"[progressive metal, progressive death metal, d...","[albums i own, concept album]"
Slayer,Hell Awaits,1985-09-01,171373.0,2729439.0,"[thrash metal, speed metal]","[1985, albums i own, metal]"
Gorguts,Obscura,1998,38129.0,1080301.0,"[technical death metal, death metal, avant-gar...",[]
Blind Guardian,A Night at the Opera,2002-03-05,95678.0,2729619.0,"[power metal, symphonic metal, progressive metal]","[epic metal, albums i own]"


## Export
We can now export the preprocessed dataframe to a csv file.

In [6]:
start = HEAD-TAIL
stop = HEAD
result_df.sort_values(by='listeners', ascending=False).to_csv('./data/proc_MA_'+ str(start) + '-' + str(HEAD) + '_albums.csv')

# Next Steps
Now, that we have exported our data to a csv file, we can use our visualization functions to further explore the history of heavy metal. Follow the link below:


[1-visualizations.ipynb](1-visualizations.ipynb)