# Data Preprocessing

In this notebook, we're demonstrating our data query package implemented in `metalhistory/data_query_api`.
The package takes care of the API calls to LastFM and Musicbrainz form which we will obtain information about the albums we're interested in.


Start with the imports...


In [1]:
import requests
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import metalhistory.data_query_functions as dqf

In [2]:
#Use the pandas extension of tqdm for pretty progress bars
tqdm.pandas()

# Data Preprocessing
Read in a Excel file that has the following structure:

|__artist__|__lastfm__|

Wherte 'artist' refers to an artist's or band's name.

In [3]:
df_csv = pd.read_csv('data/Bands.csv')
df_csv

Unnamed: 0,Band,last.fm
0,311,
1,1349,
2,Beyond the Black,
3,!T.O.O.H.!,
4,(həd) p.e.,
...,...,...
2464,Zhrine,
2465,Znöwhite,
2466,Zonaria,
2467,Zyklon,


## Extract artist information from LastFM

In [6]:
lastfm = dqf.LastFM()


FIELDS = ['name', 'stats']

result_df = df_csv
result_df['lastfm_info'] = result_df.progress_apply(lambda row: lastfm.get_artist_info(artist=row['Band'], autocorrect='1', fields=FIELDS), axis=1)

lastfm_df = result_df.lastfm_info.apply(pd.Series)
result_df = pd.concat([result_df, lastfm_df['name']], axis=1)
lastfm_df_stats = lastfm_df.stats.apply(pd.Series)
result_df = pd.concat([result_df, lastfm_df_stats], axis=1)

# result_df = result_df.set_index(['Band'])
result_df = result_df.rename({'name': 'lastfm_name'}, axis=1) 
result_df = result_df.drop(['last.fm', 'lastfm_info'], axis=1)
result_df

  0%|          | 0/2469 [00:00<?, ?it/s]

Unnamed: 0,Band,lastfm_name,0,listeners,playcount
0,311,311,,1090605,22752350
1,1349,1349,,143961,4713270
2,Beyond the Black,Beyond the Black,,41195,1471650
3,!T.O.O.H.!,!T.O.O.H.!,,10738,393662
4,(həd) p.e.,(həd) p.e.,,81382,2760580
...,...,...,...,...,...
2464,Zhrine,ZHRINE,,10823,173940
2465,Znöwhite,Znöwhite,,5145,95460
2466,Zonaria,Zonaria,,49338,1144875
2467,Zyklon,Zyklon,,62048,1370011


In [8]:
new_df = result_df
new_df

Unnamed: 0,Band,lastfm_name,0,listeners,playcount
0,311,311,,1090605,22752350
1,1349,1349,,143961,4713270
2,Beyond the Black,Beyond the Black,,41195,1471650
3,!T.O.O.H.!,!T.O.O.H.!,,10738,393662
4,(həd) p.e.,(həd) p.e.,,81382,2760580
...,...,...,...,...,...
2464,Zhrine,ZHRINE,,10823,173940
2465,Znöwhite,Znöwhite,,5145,95460
2466,Zonaria,Zonaria,,49338,1144875
2467,Zyklon,Zyklon,,62048,1370011


In [13]:
new_df = new_df.dropna(axis=1, how='all')

In [14]:
new_df

Unnamed: 0,Band,lastfm_name,listeners,playcount
0,311,311,1090605,22752350
1,1349,1349,143961,4713270
2,Beyond the Black,Beyond the Black,41195,1471650
3,!T.O.O.H.!,!T.O.O.H.!,10738,393662
4,(həd) p.e.,(həd) p.e.,81382,2760580
...,...,...,...,...
2464,Zhrine,ZHRINE,10823,173940
2465,Znöwhite,Znöwhite,5145,95460
2466,Zonaria,Zonaria,49338,1144875
2467,Zyklon,Zyklon,62048,1370011


## Export
We can now export the preprocessed dataframe to a csv file.

In [15]:
new_df.to_csv('./data/proc_Bands.csv')