## Data Integration

This notebook includes steps for integrating of two tables 'tracks_sample.csv' and 'songs_sample.csv' based on the matching pairs. These two table has two different schemas. Thus, schema of the final table E is the union of these two table's schema.

### Loading libraries and reading data

In [3]:
import pandas as pd
import os
import re

songs = pd.read_csv('dataset/songs_sample.csv')
tracks = pd.read_csv('dataset/tracks_sample.csv')
matchIDPairs = pd.read_csv('dataset/matches.csv')

# filtering the matched tuples from both dataset 
matchedTracks = tracks[tracks['id'].isin(list(matchIDPairs['ltable_id']))]
matchedSongs = songs[songs['id'].isin(list(matchIDPairs['rtable_id']))]

assert(len(matchedTracks)==len(matchedSongs))

AssertionError: 

In [2]:
matchedTracks.head()

Unnamed: 0,id,movie_title,year,episode,song_title,artists
5,262158,the porter wagoner show,1961.0,the osborne brothers (#1.517),the carroll county accident,porter wagoner
70,393455,claudia leitte: ao vivo em copacabana,2008.0,,pensando em você,henrique cerqueira+claudia leitte
91,459080,greta,2009.0,,i wanna die,jolie holland
217,426815,el crimen del padre amaro,2002.0,,te odio,rudy pérez+joel numa+pablo montero
276,328752,22nd annual trumpet awards,2014.0,,i need you now,smokie norful


In [3]:
matchedSongs.head()

Unnamed: 0,id,song_title,artists,year
147,509218,he can only hold her,amy winehouse,2006
154,218585,last train home,pat metheny group,1987
179,261294,soverato,minus 8,2004
351,958721,sweet talkin' woman,electric light orchestra,1977
476,679231,god don't never change,blind willie johnson,1989


In [4]:
matchIDPairs.head()

Unnamed: 0.1,Unnamed: 0,id,ltable_id,rtable_id
0,309,906585,253443,260085
1,196,591561,723561,68786
2,246,740185,338596,635283
3,261,788096,713603,150365
4,37,114823,246156,315410


### Merging two tables 

In [57]:
import math

#Schema of the merged table
E = pd.DataFrame(columns = ['movie_title','year','episode','song_title','artists'])

for index, row in matchIDPairs.iterrows(): 
    left_entry = matchedTracks[matchedTracks['id']==row['ltable_id']]
    right_entry = matchedSongs[matchedSongs['id']==row['rtable_id']]
    
    assert(len(left_entry)==1)
    assert(len(right_entry)==1)
    
    track_id = int(left_entry['id'].item())
    song_id = int(right_entry['id'].item())
    
    if(math.isnan(left_entry['year'].item())):
        left = 0
    else:
        left = int(left_entry['year'].item())
    
    if(math.isnan(right_entry['year'].item())):
        right = 0
    else:
        right = int(right_entry['year'].item())
    
    if left >= right and left != 0:
        year = left
    else:
        year = right
    
    #for song title, larger length value is chosen if two value doesn't have exact string match
    left = str(left_entry['song_title'].item())
    right = str(right_entry['song_title'].item())
    
    if len(left) >= len(right):
        song_title = left
    else:
        song_title = right
    
    #for artist, larger length value is chosen if two value doesn't have exact string match
    left = str(left_entry['artists'].item())
    right = str(right_entry['artists'].item())
    
    if len(left) >= len(right):
        artists = left
    else:
        artists = right
    
    #since movie and episode are unique attributes in the left table, keeping the value as it is
    movie_title = str(left_entry['movie_title'].item())
    episode = str(left_entry['episode'].item())
    
    if episode == 'NaN':
        episode = ''
    
    #creating an entry for table E with all values
    entry = pd.Series([track_id, song_id, movie_title, year, episode, song_title, artists], index=['track_id','song_id','movie_title','year','episode','song_title','artists'])
    
    #appending the merged value to table E
    E = E.append(entry, ignore_index=True)

In [58]:
E.head()

Unnamed: 0,movie_title,year,episode,song_title,artists,song_id,track_id
0,the pledge,2001.0,,poor twisted me,james hetfield+lars ulrich+metallica+arrangeme...,511255.0,678831.0
1,william s. burroughs: commissioner of sewers,1991.0,,batman br�t fische,fm einheit,150981.0,724999.0
2,the warriors,2005.0,,love is a fire,genya ravan+johnny vastano+vini poncia,328251.0,690267.0
3,t in the park 2010,2010.0,muse/calvin harris (#1.3),map of the problematique [live from wembley st...,matthew bellamy+muse,227686.0,231063.0
4,dolly parton: live & well,2004.0,,dagger through the heart,dolly parton,531984.0,418267.0


In [59]:
#Writing the table E to file
E.to_csv('merged_data.csv',sep=',',index=False)