<img src="Assets/2.png">

# `Contents:`

- [Goals](#goals)
- [Load Libraries](#load)
- [Feature Engineering](#features)
	- [Last track of the night](#last) 
	- [Has solos?](#solo)
	- [Part of a festival?](#festival)
	- [Album Counts](#album)
	- [Track Counts](#track)
	- [Geo-Location](#geo)
- [Finalise the columns and export the CSV](#final)        

<a id="goals"></a>
# `Goals`
---

Ok, great - so now we have all of our primary data (live performances scraped from Metallica.com) and all of our secondary data (all the album information from Metalstorm.net). The next step is to think about all the extra features we want to create and then finalise the database. 

<a id="load"></a>
# `Load Libraries`
---

In [917]:
import pandas as pd
import numpy as np
import regex as re
import ast
from tqdm import tqdm_notebook

In [918]:
df = pd.read_csv('./Dirty/Metallica_Data_Dirty.csv')

In [919]:
#converts any list in csv back to list from string
def converter(x):
    try:
        return ast.literal_eval(x)
    except:
        return x
    
def lower(x):
    temp = []
    if type(x) != float:
        for i in x:
            try:
                temp.append(i.lower())
            except:
                temp.append(i)
    else:
        pass 
    return temp

df['Set'] = df['Set'].apply(converter)
df['Set'] = df['Set'].apply(lower)
df['Encores'] = df['Encores'].apply(converter)
df['Encores'] = df['Encores'].apply(lower)
df['Other_Acts'] = df['Other_Acts'].apply(converter)

In [920]:
#full set list
df['Set_Encore'] = df['Set'] + df['Encores']

<a id="features"></a>
# `Feature Engineering`
---

### There are a handful of extra features which will add extra ways of analysing the data:
  
The last song of the night is always a big deal. Let's pull out what tracks are played last in the main set and last in the encore

    -Last_track_Set
    -Last_track_Encores
    
It might be interesting to see which tours we got to see any guitar / bass / drum solos and any extended medleys

    -Has_Guitar_Solo
    -Has_Bass_Solo
    -Has_Drum_Solo
    -Has_Medley
    

This is probably the most useful feature - to see what albums get played the most in a live scenario. I'll use the dictionary I created in 1.2_Data_Acquisition_Supporting_Data to supply the data I need in order to do any cross referencing and code up any album data into the live data. I will also look to pull out track counts from the set lists as their own variable which will be binarised
    
    -Count of tracks from each album
    -Count of tracks from each album
  
Some concerts were part of a festival i.e Monsters Of Rock @ Donington Park. Where feasible I'll pull out the festival name into its own series.

    -Part of festival?

Using geocoding through Googlemaps API, we can append latitude and longitude data ansd do some interesting visualisations in Tableau

    -Latitude
    -Longitude
   

<a id="last"></a>
## `Last track of the night`
---

In [921]:
#returns last element of list
def last_return(x):
    try:
        return x[-1]
    except:
        return np.nan

In [922]:
df['Last_track_Set'] = df['Set'].apply(last_return)

In [923]:
df['Last_track_Encore'] = df['Encores'].apply(last_return)

<a id="solo"></a>
## `Has solos?`
---


In [924]:
#functions that tag if set contains...
#...guitar solo
def has_guitar_solo(x):
    tag = 'guitar solo'
    if tag in str(x).lower():
        return True
    else:
        return False

#...bass solo    
def has_bass_solo(x):
    tag = 'bass solo'
    if tag in str(x).lower():
        return True
    else:
        return False

#...drum solo
def has_drum_solo(x):
    tag = 'drum solo'
    if tag in str(x).lower():
        return True
    else:
        return False
    
#...song medley     
def has_medley(x):
    tag = 'medley'
    if tag in str(x).lower():
        return True
    else:
        return False
    
def has_doodle(x):
    tag = 'doodle'
    if tag in str(x).lower():
        return True
    else:
        return False

In [925]:
df['Has_Guitar_Solo'] = df['Set'].apply(has_guitar_solo)
df['Has_Bass_Solo'] = df['Set'].apply(has_bass_solo)
df['Has_Drum_Solo'] = df['Set'].apply(has_drum_solo)
df['Has_Doodle'] = df['Set'].apply(has_doodle)
df['Has_Medley'] = df['Set'].apply(has_medley)

In [926]:
df.Has_Guitar_Solo.value_counts()

False    1744
True      326
Name: Has_Guitar_Solo, dtype: int64

In [927]:
df.Has_Bass_Solo.value_counts()

False    1489
True      581
Name: Has_Bass_Solo, dtype: int64

In [928]:
df.Has_Drum_Solo.value_counts()

False    1958
True      112
Name: Has_Drum_Solo, dtype: int64

In [929]:
df.Has_Doodle.value_counts()

False    1851
True      219
Name: Has_Doodle, dtype: int64

In [930]:
df.Has_Medley.value_counts()

False    1715
True      355
Name: Has_Medley, dtype: int64

<a id="festival"></a>
## `Part of a festival?`
---


In [931]:
pre = df[['Venue']]
pre.iloc[15:17]

Unnamed: 0,Venue
15,Austin City Limits @ Zilker Park
16,Austin City Limits @ Zilker Park


In [932]:
def keep_festival(x):
    if '@' in x:
        return x.split('@')[0].strip()
    else:
        return np.nan

In [933]:
df['Festival'] = df['Venue']

In [934]:
df['Festival'] = df['Festival'].apply(keep_festival)

In [935]:
## lets remove the festival from the venue column now
def remove_festival(x):
    if '@' in x:
        return x.split('@')[1].strip()
    else:
        return x

In [936]:
#apply function to the entire Venue column
df['Venue'] = df['Venue'].apply(remove_festival)

In [937]:
#excellent - looks like it wored
post = df[['Venue','Festival']]
post.iloc[15:17]

Unnamed: 0,Venue,Festival
15,Zilker Park,Austin City Limits
16,Zilker Park,Austin City Limits


<a id="album"></a>
## `Album Counts`
---

In [938]:
# Let's load in the previous Metallica Dictionary we made a while back
met_dict = np.load('metallica_discography.npy').item()

In [939]:
#function that takes in a set list as an argument and an album. Returns the number of times an albums song was played
#in that set
def album_counter(x,album):
    try:
        album_count=0
        #converts dictionary value (album list) into lower case
        album_list_low = [i.lower() for i in list(met_dict[album])]
        for each in x:
            if each.lower() in album_list_low :
                  album_count +=1
        return int(album_count)
    except:
        return x

In [940]:
met_dict.keys()

dict_keys(["Kill 'Em All", 'Ride The Lightning', 'Master Of Puppets', '...And Justice For All', 'Metallica', 'Load', 'Re-Load', 'Garage Inc.', 'St. Anger', 'Death Magnetic', 'Hardwired... To Self-Destruct'])

In [941]:
df["Kill_'Em_All_Count"] = df['Set'].apply(lambda x: album_counter(x,"Kill 'Em All"))
df['Ride_The_Lightning_Count'] = df['Set'].apply(lambda x: album_counter(x,"Ride The Lightning"))
df['Master_Of_Puppets_Count'] = df['Set'].apply(lambda x: album_counter(x,"Master Of Puppets"))
df['And_Justice_For_All_Count'] = df['Set'].apply(lambda x: album_counter(x,"...And Justice For All"))
df['Metallica_Count'] = df['Set'].apply(lambda x: album_counter(x,"Metallica"))
df['Load_Count'] = df['Set'].apply(lambda x: album_counter(x,"Load"))
df['Re_Load_Count'] = df['Set'].apply(lambda x: album_counter(x,"Re-Load"))
df['Garage_Inc_Count'] = df['Set'].apply(lambda x: album_counter(x,"Garage Inc."))
df['St_Anger_Count'] = df['Set'].apply(lambda x: album_counter(x,"St. Anger"))
df['Death_Magnetic_Count'] = df['Set'].apply(lambda x: album_counter(x,"Death Magnetic"))
df['Hardwired_To_Self_Destruct_Count'] = df['Set'].apply(lambda x: album_counter(x,"Hardwired... To Self-Destruct"))


In [942]:
#there are four RTL tracks in this gig...'Creeping Death','Ride the lightning','Fade to Black' 
#and 'For Whom the bell tolls'

#there are two tracks from Master of Puppets: 'Welcome Home (Sanitarium)' and 'Master of Puppets'
df.Set[0]

['hardwired',
 'atlas, rise!',
 'seek and destroy',
 'ride the lightning',
 'welcome home (sanitarium)',
 "now that we're dead",
 'creeping death',
 'for whom the bell tolls',
 'fade to black',
 'hit the lights',
 'fuel',
 'moth into flame',
 'sad but true',
 'one',
 'master of puppets']

In [943]:
#we can see the count function has worked
df[['Venue','Set','Ride_The_Lightning_Count','Master_Of_Puppets_Count']].head(1)

Unnamed: 0,Venue,Set,Ride_The_Lightning_Count,Master_Of_Puppets_Count
0,Save Mart Center,"[hardwired, atlas, rise!, seek and destroy, ri...",4,2


<a id="track"></a>
## `Track Counts`
---

In [944]:
#exracts all tracks into a list of lists
all_tracks = []
for x,y in met_dict.items():
    all_tracks.append(y)

#flattens lists of lists, cleans a record to remove a character and lower cases all tracks
flat_list = [item for sublist in all_tracks for item in sublist]
flat_list[8] = 'seek and destroy'
flat_list[4] = '(anesthesia) - pulling teeth'
flat_list = [i.lower() for i in flat_list]

#and creates a new dataframe with tracks as columns
df_tracks = pd.DataFrame(columns=flat_list)


In [945]:
#concats new track column dataframe onto original data frame on axis 1 
df = pd.concat([df, df_tracks], axis=1, sort=False)

In [946]:
#function that takes in a set list as an argument and an album. Returns the number of times an albums song was played
#in that set
def track_counter(x,track):
    track_count = 0
    trackaslist = []
    trackaslist.append(track)
    try:
        for each in x:
            if each in trackaslist:
                track_count +=1
        return int(track_count)
    except:
        return 0
   
    

In [947]:
for i in flat_list:
    df[i] = df.Set_Encore.apply(lambda x: track_counter(x,i))

<a id="geo"></a>
## `Geo-Location`
---


In [981]:
unique_locations = df.City_Country.unique()
len(unique_locations)

496

In [976]:
#Function that returns a location's lat long
APIkey = 'AIzaSyDZeCWqSnLjSu74GlYj7Acaak0odnhqAKM'

def latlong(x):
    import googlemaps
    gmaps = googlemaps.Client(key=APIkey)
    result = gmaps.geocode(x)   
    lat = result[0]['geometry']['location']['lat']
    long = result[0]['geometry']['location']['lng']
    latlong = lat,long
    try:
        return latlong
    except:
        return np.nan

In [978]:
lat_long = []
for i in tqdm_notebook(unique_locations):
    try:
        lat_long.append(latlong(i))
    except:
        lat_long.append(i)

HBox(children=(IntProgress(value=0, max=496), HTML(value='')))

In [980]:
len(lat_long)

496

In [982]:
locations = pd.DataFrame({'Location':unique_locations,'Lat_Long':lat_long})

In [983]:
locations[['Lat', 'Long']] = locations['Lat_Long'].apply(pd.Series)
locations.drop(['Lat_Long'],axis=1,inplace=True)

In [984]:
#merging the lat long data on the original data set
df = pd.merge(df, locations, left_index=False, right_index=False, 
                left_on='City_Country', right_on='Location', how = 'left')
df.drop(['Location'],axis=1,inplace = True)

In [986]:
#it's worked
df[['City_Country','Lat','Long']].tail(20)

Unnamed: 0,City_Country,Lat,Long
2050,"San Francisco, California, United States",37.7749,-122.419415
2051,"North Hollywood, California, United States",34.187,-118.381256
2052,"West Hollywood, California, United States",34.09,-118.361744
2053,"Anaheim, California, United States",33.8366,-117.914301
2054,"West Hollywood, California, United States",34.09,-118.361744
2055,"Long Beach, California, United States",33.7701,-118.19374
2056,"West Hollywood, California, United States",34.09,-118.361744
2057,"West Hollywood, California, United States",34.09,-118.361744
2058,"Dana Point, California, United States",33.4672,-117.698101
2059,"Huntington Beach, California, United States",33.6595,-117.998803


<a id="final"></a>
## `Finalise the columns and export the CSV`
---


In [997]:
final = df[['Date', 'Venue','Festival','City_Country', 'Lat', 'Long','Tour', 'Set', 'Last_track_Set', 
        'Encores','Last_track_Encore','Encores_Count', 'Set_Length', 'Other_Acts',
        'Has_Guitar_Solo', 'Has_Bass_Solo',
        'Has_Drum_Solo', 'Has_Doodle', 'Has_Medley', "Kill_'Em_All_Count",
        'Ride_The_Lightning_Count', 'Master_Of_Puppets_Count',
        'And_Justice_For_All_Count', 'Metallica_Count', 'Load_Count',
        'Re_Load_Count', 'Garage_Inc_Count', 'St_Anger_Count',
        'Death_Magnetic_Count', 'Hardwired_To_Self_Destruct_Count','hit the lights',
 'the four horsemen',
 'motorbreath',
 'jump in the fire',
 '(anesthesia) - pulling teeth',
 'whiplash',
 'phantom lord',
 'no remorse',
 'seek and destroy',
 'metal militia',
 'fight fire with fire',
 'ride the lightning',
 'for whom the bell tolls',
 'fade to black',
 'trapped under ice',
 'escape',
 'creeping death',
 'the call of ktulu',
 'battery',
 'master of puppets',
 'the thing that should not be',
 'welcome home (sanitarium)',
 'disposable heroes',
 'leper messiah',
 'orion',
 'damage, inc.',
 'blackened',
 '...and justice for all',
 'eye of the beholder',
 'one',
 'the shortest straw',
 'harvester of sorrow',
 'the frayed ends of sanity',
 'to live is to die',
 'dyers eve',
 'enter sandman',
 'sad but true',
 'holier than thou',
 'the unforgiven',
 'wherever i may roam',
 "don't tread on me",
 'through the never',
 'nothing else matters',
 'of wolf and man',
 'the god that failed',
 'my friend of misery',
 'the struggle within',
 "ain't my bitch",
 '2x4',
 'the house jack built',
 'until it sleeps',
 'king nothing',
 'hero of the day',
 'bleeding me',
 'cure',
 'poor twisted me',
 'wasting my hate',
 'mama said',
 'thorn within',
 'ronnie',
 'the outlaw torn',
 'fuel',
 'the memory remains',
 "devil's dance",
 'the unforgiven ii',
 'better than you',
 'slither',
 'carpe diem baby',
 'bad seed',
 'where the wild things are',
 'prince charming',
 "low man's lyric",
 'attitude',
 'fixxxer',
 'free speech for the dumb',
 "it's electric",
 'sabbra cadabra',
 'turn the page',
 'die, die my darling',
 'loverman',
 'mercyful fate medley',
 'astronomy',
 'whiskey in the jar',
 "tuesday's gone",
 'the more i see',
 'helpless',
 'the small hours',
 'the wait',
 'crash course in brain surgery',
 'last caress / green hell',
 'am i evil?',
 'blitzkrieg',
 'breadfan',
 'the prince',
 'stone cold crazy',
 'so what',
 'killing time',
 'overkill',
 'damage case',
 'stone dead forever',
 'too late too late',
 'frantic',
 'st.anger',
 'some kind of monster',
 'dirty window',
 'invisible kid',
 'my world',
 'shoot me again',
 'sweet amber',
 'the unnamed feeling',
 'purify',
 'all within my hands',
 'that was just your life',
 'the end of the line',
 'broken, beat & scarred',
 'the day that never comes',
 'all nightmare long',
 'cyanide',
 'the unforgiven iii',
 'the judas kiss',
 'suicide & redemption',
 'my apocalypse',
 'hardwired',
 'atlas, rise!',
 "now that we're dead",
 'moth into flame',
 'dream no more',
 'halo on fire',
 'confusion',
 'manunkind',
 'here comes revenge',
 'am i savage?',
 'murder one',
 'spit out the bone','URL']]

In [1000]:
final.to_csv('./Clean/Metallica_Data_Clean.csv',index=False)

<img src="Assets/james.gif">