# Matching Compositions to Recordings 2

Date Started: 2/9/19

It's been a minute since I've tried working on this project, but I do have a large number of new tracks that I've pulled IDs for from Spotify's API. This notebook will look at those tracks, and match them against my listing of ASCAP compositions to see if there's enough matches to move forward with a MVP

In [54]:
import json
import re
import sys
sys.path.append('~/dev/cleaning_tools')

import numpy as np
import pandas as pd

## Table of Contents

1. [Bringing in Datasets](#1)
2. [Cleaning Datasets](#2)
3. [Merging Datasets](#3)

<a name="1"></a>
## 1. Bringing in Datasets

In [2]:
track_list = pd.read_csv('../data/main_wfeats.csv', index_col=0, usecols=[0,1,4,8])
comp_artists = pd.read_csv('../data/comp_artists.csv', index_col=0)
compositions = pd.read_csv('../data/compositions.csv', index_col=0)
artist_comp_lookup = pd.read_csv('../data/artist_comp_lookup.csv', index_col=0)
comp_alt_titles = pd.read_csv('../data/comp_alt_titles.csv', index_col=0)

### 1a. New Tracks from Last Major Pull

In [3]:
with open('../data/new_tracks_20190103.json', 'r') as f:
    new_tracks = json.load(f)
    
new_tracks = pd.DataFrame.from_dict(new_tracks, orient='index').reset_index()\
                .rename(columns={'index':'song_id','Song Title': 'song_title',
                                 'Artist': 'artist_name'})

### 1b. Merging Old and New Tracklist

There's roughly 7k duplicates (by `song_id`) b/w the two sets.

In [4]:
track_list.head()

Unnamed: 0,song_id,artist_name,song_title
0,6SluaPiV04KOaRTOIScoff,Robyn,Show Me Love - Radio Version
1,5qEVq3ZEGr0Got441lueWS,Switchfoot,You Found Me (Unbroken: Path To Redemption)
2,5kqIPrATaCc2LqxVWzQGbk,Lukas Graham,7 Years
3,3aVyHFxRkf8lSjhWdJ68AW,The Killers,Just Another Girl
4,0zIyxS6QxZogHOpGkI6IZH,Tamia,Deeper


In [5]:
new_tracks.head()

Unnamed: 0,song_id,song_title,artist_name
0,0007aPK8VmXN4ycL2OcBFa,Bodhisattva - Live,Toto
1,0008G8TW7eiVfwlRRsKlgW,Don`t Go,Stevie B
2,000BqzNd7gRYnK6umzTNZX,You Still Want Me - 2014 Remastered Version,The Kinks
3,000CSIqE1KcjAiZYYWXV18,Under The Sun (Ecclesiastes),Michael Card
4,000G1xMMuwxNHmwVsBdtj1,Will Anything Happen,Blondie


In [8]:
track_list = pd.concat([track_list, new_tracks], ignore_index=True, sort=True)\
                        .drop_duplicates(subset='song_id')

In [11]:
track_list.shape

(659315, 3)

### 1c. Readying Compositions Table

In [12]:
compositions.head()

Unnamed: 0,CID,AID,Title
0,0,360318916,FOR THA LOVE OF MONEY
1,1,530659306,WE THE PEOPLE
2,2,334030418,CELERY-TIME
3,3,442081954,NEUTRON BOMB
4,4,230055482,WILL THE CIRCLE BE UNBROKEN


In [35]:
def merging_comps(comps, alt_titles, ac_lookup, c_artists):
    '''
    Creates master compositions table with a unique record for each song title and
    alternate titles. Return listing should be quite large.
    
    Each of the required arguments are tables of composition level information 
    originally pulled from ASCAP's website.
    
    Parameters
    ----------
    comps: df, compositions table 
    alt_titles: df, composition alternate titles 
    ac_lookup: df, artist-composition lookup table
    c_artists: df, artist level composition data
    '''
    comp_titles = comps[['CID', 'Title']]
    c_lookup = comps.drop('Title', 1)
    
    at = alt_titles.rename(columns={'alt-title': 'Title'})
    all_comp_titles = pd.concat([comp_titles, at], axis=0, ignore_index=True, sort=True)\
                                .sort_values('CID', 0).dropna(axis=0)
    
    full_comps = pd.merge(c_lookup, all_comp_titles, on='CID')
    print('Titles merged...Now merging in artist names')
   
    full_comps = pd.merge(ac_lookup, all_comp_titles, on='CID')
    full_comps = pd.merge(full_comps, c_artists, on='PID')
    
    return full_comps

In [36]:
comp_list = merging_comps(compositions, comp_alt_titles, artist_comp_lookup, comp_artists)

Titles merged...Now merging in artist names


In [37]:
comp_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500311 entries, 0 to 2500310
Data columns (total 4 columns):
CID               int64
PID               int64
Title             object
Performer Name    object
dtypes: int64(2), object(2)
memory usage: 95.4+ MB


In [38]:
comp_list.head()

Unnamed: 0,CID,PID,Title,Performer Name
0,0,0,FOR THA LOVE OF MONEY,BONE
1,0,0,FOE THA LOVE OF $ (FEAT. EAZY-E),BONE
2,0,0,FOE THA LOVE OF $ [EXPLICIT],BONE
3,0,0,FOE THA LOVE OF MONEY,BONE
4,0,0,FOE THE LOVE OF MONEY,BONE


<a name="2"></a>
## 2. Cleaning Datasets

### 2a. Standardizing Composition & Track Tables

In [39]:
comp_list['Performer Name'] = comp_list['Performer Name'].iloc[:].apply(lambda x: str(x).lower())
comp_list['Title'] = comp_list['Title'].iloc[:].apply(lambda x: str(x).lower())

track_list['artist_name'] = track_list['artist_name'].iloc[:].apply(lambda x: str(x).lower())\
                                                    .apply(lambda x: str(x).strip("''/*"))
track_list['song_title'] = track_list['song_title'].iloc[:].apply(lambda x: str(x).lower())\
                                                    .apply(lambda x: re.sub(r'(\(feat.*)','', x))

### 2b. Cleaning Tools

Functions I'll later save for general cleaning

In [84]:
# Does not work currently. Perhaps because the object is getting modified within the class?

import numpy as np
import pandas as pd 


class Clean:
    def __init__(self):
        self.clean_terms = cleaning_dict

    
    def add_term(self, term, regex):
        '''
        Adds to the `clean_terms` available within the instance

        Parameters:
        ----------
        term : str, term to add to `clean_terms`
        regex : str, regex implementation to find term
        '''
        if isinstance(term, str) and isinstance(regex, str):
            self.clean_terms[term] = regex
        else:
            print('Cannot add, both values should be str')
            return


    def clean_series(self, series, terms, substitute=''):
        '''
        Iterates through pandas series of strings and applies regex
        based substitutions for items passed into cleaning_dictionary
        
        Parameters:
        -----------
        series : pandas series, a series of strings to perform operations 
        on
        terms : list, items to substitute 
        substitute : str default='', Item to substitue in place of string

        Returns series with changes
        '''
        cleaning_dictionary = {k: self.clean_terms[k] for k in terms}
        for item in cleaning_dictionary:
            series = series.iloc[:].apply(lambda x: re.sub(\
                                            cleaning_dictionary[item],
                                            '', x))

        return series
        
cleaning_dict = {
    'single quote mark': r"\"",
    'dbl quote mark': r"\'",
    'brackets': r' \[.*',
    'parenthesis': r'(\s\(.*\))',
    'feat artist': r'( feat\..*)',
    'hyphens' : r' -.*',
}

### 2c. Cleaning Track List

In [58]:
track_list.head(25)

Unnamed: 0,artist_name,song_id,song_title
0,robyn,6SluaPiV04KOaRTOIScoff,show me love - radio version
1,switchfoot,5qEVq3ZEGr0Got441lueWS,you found me (unbroken: path to redemption)
2,lukas graham,5kqIPrATaCc2LqxVWzQGbk,7 years
3,the killers,3aVyHFxRkf8lSjhWdJ68AW,just another girl
4,tamia,0zIyxS6QxZogHOpGkI6IZH,deeper
5,karla bonoff,7xYDqpnQdqlgxBDm2ySggl,standing right next to me
6,juice newton,3slY9zt6oUOPDaUwRfgqzH,it's a heartache
7,kanye west,12D0n7hKpPcjuUpcbAKjjr,don't like.1
8,randy newman,5e0O7MjhNHq9G67qDFM8nR,"monsters, inc."
9,chamillionaire,3EcmNKUi5OOWXUGOsxlCca,slow loud & bangin


In [95]:
# removing quotation marks
track_list['artist_name'] = track_list['artist_name'].apply(lambda x: re.sub(r"\"","",x))

# removing parenthesis
track_list['artist_name'] = track_list['artist_name'].apply(lambda x: re.sub(r'(\s\(.*\))', "", x))

# removing feat. artists
track_list['artist_name'] = track_list['artist_name'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

In [92]:
for k in cleaning_dict:
    track_list['song_title'] = track_list['song_title'].iloc[:].apply(\
                                                            lambda x: re.sub(cleaning_dict[k],
                                                                            '', x))

### 2d. Cleaning Composition List

In [98]:
for k in cleaning_dict:
    comp_list['Title'] = comp_list['Title'].iloc[:].apply(\
                                                            lambda x: re.sub(cleaning_dict[k],
                                                                            '', x))

In [101]:
# removing quotation marks
comp_list['Performer Name'] = comp_list['Performer Name'].apply(lambda x: re.sub(r"\"","",x))

# removing parenthesis
comp_list['Performer Name'] = comp_list['Performer Name'].apply(lambda x: re.sub(r'(\s\(.*\))', "", x))

# removing feat. artists
comp_list['Performer Name'] = comp_list['Performer Name'].apply(lambda x: re.sub(r'( feat\..*)', "", x))

<a name="3"></a>
## 3. Merging Datasets

Merging datasets utilizing two different strategies:

1. Lowercasing both song title and artist names for both datasets (just doing 2a above). Results in 'merge_case_1' dataset.
2. In addition to 1, also removing quotation marks, brackets, parenthesis, and other misc formatting marks to both artist and song titles (doing all of 2). Results in 'merge_case_2' dataset.

In [103]:
test_merge = pd.merge(track_list, comp_list, how='left', left_on=['artist_name', 'song_title'],
                    right_on=['Performer Name', 'Title'])

In [46]:
merge_case_1 = test_merge[test_merge['CID'].notnull()].drop_duplicates(subset='song_id')

In [104]:
merge_case_2 = test_merge[test_merge['CID'].notnull()].drop_duplicates(subset='song_id')

In [105]:
len(merge_case_1), len(merge_case_2)

(102025, 159438)

In [106]:
merge_case_2.head(10)

Unnamed: 0,artist_name,song_id,song_title,CID,PID,Title,Performer Name
0,robyn,6SluaPiV04KOaRTOIScoff,show me love,260151.0,14428.0,show me love,robyn
3,lukas graham,5kqIPrATaCc2LqxVWzQGbk,7 years,43137.0,53876.0,7 years,lukas graham
10,the killers,3aVyHFxRkf8lSjhWdJ68AW,just another girl,76427.0,39793.0,just another girl,the killers
11,tamia,0zIyxS6QxZogHOpGkI6IZH,deeper,8897.0,14338.0,deeper,tamia
14,kanye west,12D0n7hKpPcjuUpcbAKjjr,dont like.1,68936.0,2118.0,dont like.1,kanye west
15,randy newman,5e0O7MjhNHq9G67qDFM8nR,"monsters, inc.",177077.0,13683.0,"monsters, inc.",randy newman
20,brandi carlile,0E0aHF7AnmeMNIAb7GJSQa,most of all,208997.0,28037.0,most of all,brandi carlile
22,desree,6ygrn7Of3p3mkP483pO1GT,crazy maze,179963.0,99695.0,crazy maze,desree
26,beanie sigel,2037Ob3nwf6lnMCByEoTSB,the truth,82770.0,60504.0,the truth,beanie sigel
32,the 1975,44Ljlpy44mHvLJxcYUvTK0,chocolate,119114.0,8414.0,chocolate,the 1975


In [107]:
merge_case_2.to_csv('../matched_songs_20190209.csv')