# Matching Compositions to Recordings

In [1]:
import pickle
import re

import numpy as np
import pandas as pd

### Bringing in Track + Composition Tables

In [2]:
track_list = pd.read_csv('../data/main_wfeats.csv', index_col=0)
comp_artists = pd.read_csv('../data/comp_artists.csv', index_col=0)
compositions = pd.read_csv('../data/compositions.csv', index_col=0)
artist_comp_lookup = pd.read_csv('../data/artist_comp_lookup.csv', index_col=0)
comp_alt_titles = pd.read_csv('../data/comp_alt_titles.csv', index_col=0)

In [3]:
compositions.head()

Unnamed: 0,CID,AID,Title
0,0,360318916,FOR THA LOVE OF MONEY
1,1,530659306,WE THE PEOPLE
2,2,334030418,CELERY-TIME
3,3,442081954,NEUTRON BOMB
4,4,230055482,WILL THE CIRCLE BE UNBROKEN


In [4]:
artist_comp_lookup.head()

Unnamed: 0,CID,PID
0,0,0
1,1933,0
2,17624,0
3,17630,0
4,17633,0


In [5]:
comp_artists.head()

Unnamed: 0,Performer Name,PID
0,BONE,0
1,BONE THUGS N HARMONY,1
2,BONE THUGS N HARMONY FEAT. EAZY-E,2
3,BONE THUGS-N-HARMONY,3
4,BONE THUGS-N-HARMONY (EDITED),4


In [6]:
comp_alt_titles.head()

Unnamed: 0,alt-title,CID
1,FOE THA LOVE OF $,0
2,FOE THA LOVE OF $,0
3,FOE THA LOVE OF $ (FEAT EASY-E),0
4,FOE THA LOVE OF $ (FEAT. EAZY-E),0
5,FOE THA LOVE OF $ [EXPLICIT],0


#### Re-Doing Compositions Table, So That All Titles (Alt + Regular) Are Within the Same Column

I noticed that many songs that need to be matched can only be matched with a particular default title OR alt-title. Populating them within the same field will allow me to do a simple pandas merge to optimize track/composition matches.

In [7]:
comp_titles = compositions[['CID', 'Title']]
compositions.drop('Title', 1, inplace=True)

##### Merging `comp_titles` + `comp_alt_titles`

In [8]:
comp_alt_titles.rename(columns={'alt-title':'Title'}, inplace=True)

In [9]:
all_comp_titles = pd.concat([comp_titles, comp_alt_titles], axis=0, ignore_index=True, sort=False)

In [10]:
all_comp_titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423900 entries, 0 to 423899
Data columns (total 2 columns):
CID      423900 non-null int64
Title    422710 non-null object
dtypes: int64(1), object(1)
memory usage: 6.5+ MB


In [11]:
all_comp_titles.sort_values('CID', 0, inplace=True)
all_comp_titles.dropna(axis=0, inplace=True)

In [12]:
all_comp_titles = pd.merge(compositions, all_comp_titles, on='CID')

#### Merging Composition Tables

In [13]:
full_comps = pd.merge(artist_comp_lookup, all_comp_titles, on='CID')
full_comps = pd.merge(full_comps, comp_artists, on='PID')

In [14]:
full_comps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500311 entries, 0 to 2500310
Data columns (total 5 columns):
CID               int64
PID               int64
AID               int64
Title             object
Performer Name    object
dtypes: int64(3), object(2)
memory usage: 114.5+ MB


In [15]:
full_comps.head()

Unnamed: 0,CID,PID,AID,Title,Performer Name
0,0,0,360318916,FOR THA LOVE OF MONEY,BONE
1,0,0,360318916,FOE THA LOVE OF $ (FEAT. EAZY-E),BONE
2,0,0,360318916,FOE THA LOVE OF $ [EXPLICIT],BONE
3,0,0,360318916,FOE THA LOVE OF MONEY,BONE
4,0,0,360318916,FOE THE LOVE OF MONEY,BONE


#### Examining Track Table

In [16]:
track_list.head()

Unnamed: 0,song_id,album_release_date,artist_id,artist_name,duration_ms,explicit,linked_album,song_title,danceability,energy,...,pv_dim_3,pv_dim_4,pv_dim_5,pv_dim_6,pv_dim_7,pv_dim_8,pv_dim_9,pv_dim_10,pv_dim_11,pv_dim_12
0,6SluaPiV04KOaRTOIScoff,1995-10-13,6UE7nl9mha6s8z0wFQFIZ2,Robyn,229226.0,False,Robyn Is Here,Show Me Love - Radio Version,0.546,0.643,...,0.231588,0.227392,0.365724,0.220462,0.367808,0.267055,0.344281,0.349016,0.323426,0.480299
1,5qEVq3ZEGr0Got441lueWS,2018-08-10,6S58b0fr8TkWrEHOH4tRVu,Switchfoot,247240.0,False,You Found Me (Unbroken: Path To Redemption),You Found Me (Unbroken: Path To Redemption),0.603,0.802,...,0.384941,0.397085,0.465443,0.237421,0.359981,0.209631,0.283483,0.188632,0.212271,0.49047
2,5kqIPrATaCc2LqxVWzQGbk,2016-04-01,25u4wHJWxCA9vO0CzxAbK7,Lukas Graham,237300.0,False,Lukas Graham,7 Years,0.765,0.473,...,0.341671,0.321183,0.195459,0.330539,0.175221,0.328568,0.153059,0.221073,0.444818,0.203276
3,3aVyHFxRkf8lSjhWdJ68AW,2013-01-01,0C0XlULifJtAgn6ZNCW2eu,The Killers,262000.0,False,Direct Hits,Just Another Girl,0.547,0.779,...,0.229995,0.264792,0.180531,0.281061,0.355194,0.189039,0.256742,0.193406,0.25314,0.308046
4,0zIyxS6QxZogHOpGkI6IZH,2018-09-07,0le01dl1WllSHhjEXRl4in,Tamia,236545.0,False,Passion Like Fire,Deeper,0.438,0.288,...,0.233717,0.128174,0.32137,0.20037,0.391387,0.132925,0.265942,0.537358,0.158429,0.26679


#### Standardizing Composition and Track Tables

In [17]:
full_comps['Title_n'] = full_comps['Title'].apply(lambda x: x.lower())
full_comps['Performer_n'] = full_comps['Performer Name'].apply(lambda x: str(x).lower())

track_list['artist_name_n'] = track_list['artist_name'].apply(lambda x: str(x).lower()).apply(lambda x: str(x).strip("''/*"))
track_list['song_title_n'] = track_list['song_title'].apply(lambda x: str(x).lower()).apply(lambda x: re.sub(r'(\(feat.*)','', x))

### Combining Tables

In [18]:
lol_test = pd.merge(track_list, full_comps, how='left', left_on=['artist_name_n', 'song_title_n'],
                    right_on=['Performer_n', 'Title_n'])

In [19]:
len(lol_test[lol_test['CID'].notnull()])

8673

Not bad for literally doing nothing but removing caps, and 'featuring' language. Let's see what other formatting issues I can minimize

In [20]:
track_list.sort_values('artist_name_n', axis=0, inplace=True)

#### Artists, Songs -w- Special Marks (special characters, "feat", other weird stuff)

##### Spotify Artists

In [21]:
# Parenthesis
track_list['artist_name_n'][track_list['artist_name_n'].str.contains("\(")]

6362     prodigy (of mobb deep) feat. kurupt, jayo felo...
7055         trillville (featuring lil' scrappy & lil jon)
17294        willie nelson & roger miller (with ray price)
22615    young, wild & free (snoop dogg, wiz khalifa & ...
Name: artist_name_n, dtype: object

In [22]:
# Brackets
track_list['artist_name_n'][track_list['artist_name_n'].str.contains("\[")]

22615    young, wild & free (snoop dogg, wiz khalifa & ...
Name: artist_name_n, dtype: object

In [23]:
# Quotation Marks
track_list['artist_name_n'][track_list['artist_name_n'].str.contains("\"")].value_counts()

"weird al" yankovic        10
johnny "guitar" watson     10
evelyn "champagne" king    10
héctor "el father"          1
Name: artist_name_n, dtype: int64

In [24]:
# presence of "feat"
track_list['artist_name_n'][track_list['artist_name_n'].str.contains("feat\.")]

13847    john p. kee and new life feat. james fortune, ...
6362     prodigy (of mobb deep) feat. kurupt, jayo felo...
Name: artist_name_n, dtype: object

##### Spotify Songs

In [25]:
# Parenthesis
len(track_list['song_title_n'][track_list['song_title_n'].str.contains("\(")])

1366

In [26]:
# Brackets
len(track_list['song_title_n'][track_list['song_title_n'].str.contains("\[")])

82

In [27]:
# Quotation Marks
len(track_list['song_title_n'][track_list['song_title_n'].str.contains("\"")])

242

In [28]:
# Hyphens connecting song versions to title
len(track_list['song_title_n'][track_list['song_title_n'].str.contains(" - ")])

2672

In [29]:
# presence of "feat."
len(track_list['song_title_n'][track_list['song_title_n'].str.contains("feat")])

122

In [30]:
# ampersand
len(track_list['song_title_n'][track_list['song_title_n'].str.contains("&")])

231

##### ASCAP Artists

In [31]:
# Parenthesis
len(full_comps['Performer_n'][full_comps['Performer_n'].str.contains("\(")])

9183

In [32]:
# Brackets
full_comps['Performer_n'][full_comps['Performer_n'].str.contains("\[")]

1294358    various / [kate smith w/ jack smith & hi
1294359    various / [kate smith w/ jack smith & hi
1294360    various / [kate smith w/ jack smith & hi
1294361    various / [kate smith w/ jack smith & hi
1294362    various / [kate smith w/ jack smith & hi
1294363    various / [kate smith w/ jack smith & hi
1554379    various / [peabo bryson & roberta flack]
1554380    various / [peabo bryson & roberta flack]
1554381    various / [peabo bryson & roberta flack]
1554382    various / [peabo bryson & roberta flack]
1554383    various / [peabo bryson & roberta flack]
1554384    various / [peabo bryson & roberta flack]
1554385    various / [peabo bryson & roberta flack]
Name: Performer_n, dtype: object

In [33]:
# Quotation Marks
len(full_comps['Performer_n'][full_comps['Performer_n'].str.contains("\"")])

839

In [34]:
# Ampersand
len(full_comps['Performer_n'][full_comps['Performer_n'].str.contains("&")])

54251

##### ASCAP Songs

In [35]:
# Parenthesis
len(full_comps['Title_n'][full_comps['Title_n'].str.contains("\(")])

513731

In [36]:
# Quotation Marks
len(full_comps['Title_n'][full_comps['Title_n'].str.contains("\"")])

47583

In [37]:
# Ampersand
len(full_comps['Title_n'][full_comps['Title_n'].str.contains("&")])

22876

In [38]:
# presence of "feat"
len(full_comps['Title_n'][full_comps['Title_n'].str.contains("feat")])

32425

In [39]:
# Songs -w- Hyphens delineating special versions
len(full_comps['Title_n'][full_comps['Title_n'].str.contains(" - ")])

31513

### Additional Cleaning

#### Spotify Artists

In [40]:
# removing quotation marks
track_list['artist_name_n'] = track_list['artist_name_n'].apply(lambda x: re.sub(r"\"","",x))

# removing parenthesis
track_list['artist_name_n'] = track_list['artist_name_n'].apply(lambda x: re.sub(r'(\s\(.*\))', "", x))

# removing feat. artists
track_list['artist_name_n'] = track_list['artist_name_n'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

#### Spotify Songs

In [41]:
# removing quotation marks
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r"\"","", x))
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r"\'","", x))

# removing brackets
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r' \[.*',"", x))

# removing parenthesis
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r'(\s\(.*\))', "", x))

# removing feat. artists
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

# removing hyphens
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r' -.*', "", x))

# removing feat
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

In [42]:
# changing 'ampersand' to and 
track_list['song_title_n'] = track_list['song_title_n'].apply(lambda x: re.sub(r'', "", x))

#### ASCAP Artists

In [43]:
# removing quotation marks
full_comps['Performer_n'] = full_comps['Performer_n'].apply(lambda x: re.sub(r"\"","",x))

# removing parenthesis
full_comps['Performer_n'] = full_comps['Performer_n'].apply(lambda x: re.sub(r'(\s\(.*\))', "", x))

# removing feat. artists
full_comps['Performer_n'] = full_comps['Performer_n'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

#### ASCAP Songs

In [44]:
# removing quotation marks
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r"\"","", x))
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r"\'","", x))

# removing brackets
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r' \[.*',"", x))

# removing parenthesis
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r'(\s\(.*\))', "", x))

# removing feat. artists
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

# removing hyphens
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r' -.*', "", x))

# removing feat
full_comps['Title_n'] = full_comps['Title_n'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

In [66]:
full_comps.head(15)

Unnamed: 0,CID,PID,AID,Title,Performer Name,Title_n,Performer_n
0,0,0,360318916,FOR THA LOVE OF MONEY,BONE,for tha love of money,bone
1,0,0,360318916,FOE THA LOVE OF $ (FEAT. EAZY-E),BONE,foe tha love of $,bone
2,0,0,360318916,FOE THA LOVE OF $ [EXPLICIT],BONE,foe tha love of $,bone
3,0,0,360318916,FOE THA LOVE OF MONEY,BONE,foe tha love of money,bone
4,0,0,360318916,FOE THE LOVE OF MONEY,BONE,foe the love of money,bone
5,0,0,360318916,FOR THE LOVE OF MONEY,BONE,for the love of money,bone
6,0,0,360318916,FOE THA LOVE OF $,BONE,foe tha love of $,bone
7,0,0,360318916,FOE THA LOVE OF $ (FEAT EASY-E),BONE,foe tha love of $,bone
8,0,0,360318916,FOE THA LOVE OF $,BONE,foe tha love of $,bone
9,1933,0,350208616,ETERNAL,BONE,eternal,bone


In [46]:
full_comps.drop_duplicates(['Title_n','Performer_n'], inplace=True)

### Merge Try 2

In [47]:
lol_test_2 = pd.merge(track_list, full_comps, how='left', left_on=['artist_name_n', 'song_title_n'],
                      right_on=['Performer_n', 'Title_n'])

In [48]:
lol_test_2[['Title', 'Performer Name']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22891 entries, 0 to 22890
Data columns (total 2 columns):
Title             9209 non-null object
Performer Name    9209 non-null object
dtypes: object(2)
memory usage: 536.5+ KB


In [59]:
pd.set_option('display.max_rows', 15)
lol_test_2[lol_test_2['Title_n'].notna()].tail(30)

Unnamed: 0,song_id,album_release_date,artist_id,artist_name,duration_ms,explicit,linked_album,song_title,danceability,energy,...,pv_dim_12,artist_name_n,song_title_n,CID,PID,AID,Title,Performer Name,Title_n,Performer_n
22781,54wZPGejEUmJtIEcER8pqG,2006-06-06,23LbwefIODbyGdRbAz3urj,Yung Joc,250013.0,True,New Joc City (Explicit Content U.S. Version),Do Ya Bad,0.752,0.651,...,0.361326,yung joc,do ya bad,138107.0,35477.0,341894311.0,DO YA BAD,YUNG JOC,do ya bad,yung joc
22784,40PPe4g7BcqzxddzeSgtZQ,2009-05-13,23LbwefIODbyGdRbAz3urj,Yung Joc,164157.0,True,Reggae Crunk Shit Vol 8 (Dj Weedim Part),Its Going Down,0.754,0.471,...,0.507336,yung joc,its going down,102614.0,35477.0,371026890.0,IT'S GOING DOWN,YUNG JOC,its going down,yung joc
22787,48sLBplbfMr0db6mvZ5L1t,2006-06-06,23LbwefIODbyGdRbAz3urj,Yung Joc,265986.0,True,New Joc City (Explicit Content U.S. Version),Knock It Out,0.618,0.606,...,0.233565,yung joc,knock it out,138139.0,35477.0,410495111.0,KNOCK IT OUT [EXPLICIT ALBUM VERSION],YUNG JOC,knock it out,yung joc
22792,1rHMg7t3LICgfTl2OsDn46,1971-01-01,08F3Y3SctIlsOEmKd6dnH8,Yusuf / Cat Stevens,102200.0,False,Teaser And The Firecat,The Wind,0.764,0.188,...,0.086359,yusuf / cat stevens,the wind,42487.0,15498.0,530171832.0,THE WIND - FROM SING ORIGINAL MOTION PICTURE,YUSUF / CAT STEVENS,the wind,yusuf / cat stevens
22794,23knlSaE1nRy1PGdF2gJbN,2003-01-01,08F3Y3SctIlsOEmKd6dnH8,Yusuf / Cat Stevens,166533.0,False,The Very Best Of Cat Stevens,"If You Want To Sing Out, Sing Out",0.641,0.294,...,0.278166,yusuf / cat stevens,"if you want to sing out, sing out",52465.0,15498.0,390337652.0,"IF YOU WANT TO SING OUT, SING OUT",YUSUF / CAT STEVENS,"if you want to sing out, sing out",yusuf / cat stevens
22795,19slC7k8bsPOAKDjHYLU2W,1970-11-23,08F3Y3SctIlsOEmKd6dnH8,Yusuf / Cat Stevens,221000.0,False,Tea for the Tillerman (Deluxe Edition),Father And Son,0.495,0.343,...,0.354283,yusuf / cat stevens,father and son,6353.0,15498.0,360102649.0,FATHER AND SON,YUSUF / CAT STEVENS,father and son,yusuf / cat stevens
22796,3BqqF8suAIzW8655yJfcvh,1971-01-01,08F3Y3SctIlsOEmKd6dnH8,Yusuf / Cat Stevens,200000.0,False,Teaser And The Firecat,Morning Has Broken,0.443,0.324,...,0.270379,yusuf / cat stevens,morning has broken,37292.0,15498.0,430394424.0,MORNING HAS BROKEN,YUSUF / CAT STEVENS,morning has broken,yusuf / cat stevens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22848,2UwCXCPetkLgqDSeTKuQGx,2014-05-30,5Tz4zMiRWqiQVAymWZz99a,Zebra Katz,280645.0,True,Tear The House Up,Tear The House Up,0.935,0.712,...,0.217781,zebra katz,tear the house up,1312.0,3593.0,887767256.0,TEAR THE HOUSE UP,ZEBRA KATZ,tear the house up,zebra katz
22851,3E7h2RNgneP1dbGfd98EZK,2014-05-30,5Tz4zMiRWqiQVAymWZz99a,Zebra Katz,195544.0,True,Tear The House Up,Tear The House Up - Edit,0.879,0.832,...,0.210362,zebra katz,tear the house up,1312.0,3593.0,887767256.0,TEAR THE HOUSE UP,ZEBRA KATZ,tear the house up,zebra katz


In [41]:
track_list.columns

Index(['song_id', 'album_release_date', 'artist_id', 'artist_name',
       'duration_ms', 'explicit', 'linked_album', 'song_title', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'key_changes', 'mean_song_conf', 'mean_loudness', 'mean_mode',
       'mean_mode_conf', 'mean_tempo', 'mean_tempo_conf', 'var_song_conf',
       'var_loudness', 'var_mode', 'var_mode_conf', 'var_tempo',
       'var_tempo_conf', 'tm_dim_1', 'tm_dim_2', 'tm_dim_3', 'tm_dim_4',
       'tm_dim_5', 'tm_dim_6', 'tm_dim_7', 'tm_dim_8', 'tm_dim_9', 'tm_dim_10',
       'tm_dim_11', 'tm_dim_12', 'tv_dim_1', 'tv_dim_2', 'tv_dim_3',
       'tv_dim_4', 'tv_dim_5', 'tv_dim_6', 'tv_dim_7', 'tv_dim_8', 'tv_dim_9',
       'tv_dim_10', 'tv_dim_11', 'tv_dim_12', 'pm_dim_1', 'pm_dim_2',
       'pm_dim_3', 'pm_dim_4', 'pm_dim_5', 'pm_dim_6', 'pm_dim_7', 'pm_dim_8',
       'pm_dim_9', 'pm_dim_10', 'pm_dim_

In [71]:
track_list.drop(['album_release_date','duration_ms', 'explicit', 
                 'linked_album', 'danceability',
                 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
                 'key_changes', 'mean_song_conf', 'mean_loudness', 'mean_mode',
                 'mean_mode_conf', 'mean_tempo', 'mean_tempo_conf', 'var_song_conf',
                 'var_loudness', 'var_mode', 'var_mode_conf', 'var_tempo',
                 'var_tempo_conf', 'tm_dim_1', 'tm_dim_2', 'tm_dim_3', 'tm_dim_4',
                 'tm_dim_5', 'tm_dim_6', 'tm_dim_7', 'tm_dim_8', 'tm_dim_9', 'tm_dim_10',
                 'tm_dim_11', 'tm_dim_12', 'tv_dim_1', 'tv_dim_2', 'tv_dim_3',
                 'tv_dim_4', 'tv_dim_5', 'tv_dim_6', 'tv_dim_7', 'tv_dim_8', 'tv_dim_9',
                  'tv_dim_10', 'tv_dim_11', 'tv_dim_12', 'pm_dim_1', 'pm_dim_2',
                  'pm_dim_3', 'pm_dim_4', 'pm_dim_5', 'pm_dim_6', 'pm_dim_7', 'pm_dim_8',
                 'pm_dim_9', 'pm_dim_10', 'pm_dim_11', 'pm_dim_12', 'pv_dim_1',
                 'pv_dim_2', 'pv_dim_3', 'pv_dim_4', 'pv_dim_5', 'pv_dim_6', 'pv_dim_7',
                 'pv_dim_8', 'pv_dim_9', 'pv_dim_10', 'pv_dim_11', 'pv_dim_12'] ,1, inplace=True)

### Merge Try 3 (To Diagnose What I'm Unable to Match)

In [43]:
lol_test_2 = pd.merge(track_list, full_comps, how='left', left_on=['artist_name_n', 'song_title_n'],
                      right_on=['Performer_n', 'Title_n'])

In [44]:
lol_test_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42871 entries, 0 to 42870
Data columns (total 13 columns):
artist_id         42871 non-null object
artist_name       42871 non-null object
song_title        42871 non-null object
artist_name_n     42871 non-null object
song_title_n      42871 non-null object
CID               24608 non-null float64
PID               24608 non-null float64
AID               24608 non-null float64
Title             24608 non-null object
alt-title         24607 non-null object
Performer Name    24608 non-null object
Title_n           24608 non-null object
Performer_n       24608 non-null object
dtypes: float64(3), object(10)
memory usage: 4.6+ MB


In [45]:
lol_test_2.iloc[11124:11134]

Unnamed: 0,artist_id,artist_name,song_title,artist_name_n,song_title_n,CID,PID,AID,Title,alt-title,Performer Name,Title_n,Performer_n
11124,33ScadVnbm2X8kkUqOkC6Z,Don Omar,Dale Don Dale,don omar,dale don dale,,,,,,,,
11125,33ScadVnbm2X8kkUqOkC6Z,Don Omar,Guaya Guaya,don omar,guaya guaya,,,,,,,,
11126,33ScadVnbm2X8kkUqOkC6Z,Don Omar,Pobre Diabla,don omar,pobre diabla,,,,,,,,
11127,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,GLAUBST DU ECHT ANS MATERIAL,DON WILLIAMS,tulsa time,don williams
11128,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,LIVING ON TULSA TIME,DON WILLIAMS,tulsa time,don williams
11129,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,TULSA TIME EXCERPT,DON WILLIAMS,tulsa time,don williams
11130,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,TULSA TIME IWF,DON WILLIAMS,tulsa time,don williams
11131,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,TULSA TIME: GLAUBST DU,DON WILLIAMS,tulsa time,don williams
11132,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,TULSA TIME:HIMMEL IST DAS,DON WILLIAMS,tulsa time,don williams
11133,4Ti0EKl2PVEms2NRMVGqNe,Don Williams,Tulsa Time,don williams,tulsa time,217585.0,13198.0,500281163.0,TULSA TIME,TULSA TIME:LEBEN SO WIE I,DON WILLIAMS,tulsa time,don williams


lemme try and find "Party in the CIA" on my ASCAP song list

In [46]:
pd.set_option('display.max_rows', 45)
full_comps[full_comps['Performer_n'] == 'weird al yankovic'].head()

Unnamed: 0,CID,PID,AID,Title,alt-title,Performer Name,Title_n,Performer_n
1136820,15128,17376,380129566,HEY JUDE,"301039 (""HEY JUDE"")",WEIRD AL YANKOVIC,hey jude,weird al yankovic
1136821,15128,17376,380129566,HEY JUDE,HEY JUDE,WEIRD AL YANKOVIC,hey jude,weird al yankovic
1136822,15128,17376,380129566,HEY JUDE,HEY JUDE 68-10,WEIRD AL YANKOVIC,hey jude,weird al yankovic
1136823,15128,17376,380129566,HEY JUDE,HEY JUDS,WEIRD AL YANKOVIC,hey jude,weird al yankovic
1136824,15128,17376,380129566,HEY JUDE,"M301039 (""HEY JUDE"")",WEIRD AL YANKOVIC,hey jude,weird al yankovic


It's there, but apparently the default title is called "Party at the CIA", fuggin WACK. I wonder if this is one of the bigger problems in merging the two lists - the default titles being different.

I also just learned that Weird Al is a member of BMI, which makes sense that ASCAP wouldn't have all of his comps

Looking at the artist, "10,000 Maniacs" here:

In [47]:
pd.set_option('display.max_rows', 36)
full_comps[full_comps['Performer_n'] == '10,000 maniacs']

Unnamed: 0,CID,PID,AID,Title,alt-title,Performer Name,Title_n,Performer_n
1251353,45420,44160,500249234,THESE DAYS,(THESE DAYS) AKA I'VE BEEN OUT WALKING,"10,000 MANIACS",these days,"10,000 maniacs"
1251354,45420,44160,500249234,THESE DAYS,I VE BEEN OUT WALKING,"10,000 MANIACS",these days,"10,000 maniacs"
1251355,30177,44160,360288799,FEW AND FAR BETWEEN,FEW AND FAR BETWEEN MERCHANT,"10,000 MANIACS",few and far between,"10,000 maniacs"
1251356,209910,44160,310503327,A ROOM FOR EVERYTHING,A ROOM FOR EVERYHTING,"10,000 MANIACS",a room for everything,"10,000 maniacs"
1251357,209910,44160,310503327,A ROOM FOR EVERYTHING,A ROOM FOR EVERYTHING,"10,000 MANIACS",a room for everything,"10,000 maniacs"
1251358,209910,44160,310503327,A ROOM FOR EVERYTHING,ROOM FOR EVERYTHING,"10,000 MANIACS",a room for everything,"10,000 maniacs"
1251359,209911,44160,310259753,AMONG THE AMERICANS,AMONG THE MARICANS,"10,000 MANIACS",among the americans,"10,000 maniacs"
1251360,209914,44160,320327339,BACK O THE MOON,BACK OF THE MOON,"10,000 MANIACS",back o the moon,"10,000 maniacs"
1251361,209919,44160,330402247,COLONIAL WING THE,COLONIAL WING,"10,000 MANIACS",colonial wing the,"10,000 maniacs"
1251362,209921,44160,340282288,DON T TALK,DONT TALK,"10,000 MANIACS",don t talk,"10,000 maniacs"


In [48]:
lol_test_2[lol_test_2['artist_name_n'] == '10,000 maniacs']

Unnamed: 0,artist_id,artist_name,song_title,artist_name_n,song_title_n,CID,PID,AID,Title,alt-title,Performer Name,Title_n,Performer_n
20,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",More Than This,"10,000 maniacs",more than this,,,,,,,,
21,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",Candy Everybody Wants,"10,000 maniacs",candy everybody wants,,,,,,,,
22,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",These Are Days [MTV Unplugged Version],"10,000 maniacs",these are days,,,,,,,,
23,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",To Sir With Love,"10,000 maniacs",to sir with love,,,,,,,,
24,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",These Are Days,"10,000 maniacs",these are days,,,,,,,,
25,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",Trouble Me,"10,000 maniacs",trouble me,,,,,,,,
26,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",Because The Night [MTV Unplugged Version],"10,000 maniacs",because the night,,,,,,,,
27,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",Like The Weather,"10,000 maniacs",like the weather,,,,,,,,
28,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",Hey Jack Kerouac,"10,000 maniacs",hey jack kerouac,209929.0,44160.0,380284942.0,HEY JACK KEROUAC,HEJ JACK KEROUAC,"10,000 MANIACS",hey jack kerouac,"10,000 maniacs"
29,0MBIKH9DjtBkv8O3nS6szj,"10,000 Maniacs",Hey Jack Kerouac,"10,000 maniacs",hey jack kerouac,209929.0,44160.0,380284942.0,HEY JACK KEROUAC,HEY JACK KEROUAC,"10,000 MANIACS",hey jack kerouac,"10,000 maniacs"


After doing some investigation online, it looks like "10,000" maniacs also is registered like "10 000 MANIACS" on ASCAP, and ASCAP did not show my scraper that result based on the search term that I used

### New Idea

I could possibly merge the old ASCAP works list (~2gb), and see if I can get simple artist+song matches agains the spotify list. If that's more successful,..actually, there's no ARTIST NAME on that catalog, which would make that task useless. Nonetheless, I'm going to download a new version of their catalog to see if they might've updated it.

**Update (12/11/18)**: They haven't updated the file with artist names

In [49]:
%whos DataFrame

Variable             Type         Data/Info
-------------------------------------------
artist_comp_lookup   DataFrame               CID     PID\n0<...>[570810 rows x 2 columns]
comp_alt_titles      DataFrame                             <...>[159725 rows x 2 columns]
comp_artists         DataFrame                             <...>[132614 rows x 2 columns]
compositions         DataFrame               CID        AID<...>[264175 rows x 3 columns]
full_comps           DataFrame                CID     PID  <...>1930708 rows x 8 columns]
lol_test             DataFrame                          son<...>[35152 rows x 91 columns]
lol_test_2           DataFrame                        artis<...>[42871 rows x 13 columns]
track_list           DataFrame                        artis<...>n[22891 rows x 5 columns]
