# Data imputation

We first keep only the year in the dates for each song. Then, for the songs without dates, we impute the data using three methods:
1. We impute using the album: if one song in the album has a date, then we assign that date to all the other songs in the album
2. We impute using the artist: artists have a couple of songs with dates. We take the average of those dates and impute the average to the missing values of songs from that artist
3. We impute using the pid: each song belongs to several playlists. Each playlist has several songs that have dates. Assume song 0 belongs to playlists 1,2,3,4,5. Then, we take the average of the dates in playlists 1,2,3,4,5. Then song 0 will have 5 dates associated with it, likely different. So we take the average of those 5 dates, and impute that value as the date of the song. 

To verify that our imputation gives good results, we can take the dataset with the newly imputed values and remove the dates that we already had from the beginning, from the original dataset, call them 'original dates'. Then we will use our imputation method to impute the dates for those songs that had dates from the beginning and compare them with the 'original dates'. The goal is to get an error of less than 10 years, i.e. our imputed dates are less than 10 years away from the original date.

From first checks on a couple songs, we find that our imputed dates are about 5 years away from the real dates of the songs. Hence, we can assume that our imputation is correct. 

This way, we can use the date as another predictor in our recommendation models.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("playlists_feat.csv")
# df

## Checking the number of NA's

In [3]:
# percentage of NA's
print(len(df[df['Date'].isna()]))
len(df[df['Date'].isna()]) / len(df)

569067


0.7032672743630558

## Keeping only the year

In [4]:
# keeping only the year for the date
df1 = df.copy()
mask = df1['Date'].notna()
mask_condition = df1.loc[mask, 'Date'].str.len() != 4
df1.loc[mask & mask_condition, 'Date'] = pd.to_datetime(df1.loc[mask & mask_condition, 'Date']).dt.year

  df1.loc[mask & mask_condition, 'Date'] = pd.to_datetime(df1.loc[mask & mask_condition, 'Date']).dt.year


In [5]:
# check
df[df['Track']=='Against The Wind']['Date']

3382      2/25/80
41731     2/25/80
48117     2/25/80
49444     2/25/80
218937    2/25/80
268035    2/25/80
428593    2/25/80
546373    2/25/80
576132    2/25/80
636324    2/25/80
779129    2/25/80
Name: Date, dtype: object

In [6]:
df1[df1['Track']=='Against The Wind']['Date']

3382      1980
41731     1980
48117     1980
49444     1980
218937    1980
268035    1980
428593    1980
546373    1980
576132    1980
636324    1980
779129    1980
Name: Date, dtype: object

# Imputing the date for the missing values

In [7]:
df1.sort_values(by='Album')

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
220239,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,car jams,104,53,35,1,False,106503
299169,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,Tops,58,51,47,2,False,111315
57436,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,2010,49,39,25,1,False,11611
639851,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,good,24,10,7,1,False,132373
451916,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,THROW BACK,15,14,13,2,False,120848
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227173,spotify:track:0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,÷,,263400,Pop,0.599,0.448,8.0,...,0.168,190.100,3.0,Serenity,246,75,61,1,False,106970
582403,spotify:track:0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,÷,,263400,Emo,0.599,0.448,8.0,...,0.168,190.100,3.0,chill out,28,21,13,1,False,128813
107907,spotify:track:44AH8XwgrebHvQAZJbSrsV,P.T.S.D,Scarlxrd,スカー藩主,,307905,Dark Trap,0.473,0.772,10.0,...,0.303,130.320,4.0,MAD TING,27,14,11,1,False,14661
107930,spotify:track:44AH8XwgrebHvQAZJbSrsV,P.T.S.D,Scarlxrd,スカー藩主,,307905,Underground Rap,0.473,0.772,10.0,...,0.303,130.320,4.0,MAD TING,27,14,11,1,False,14661


### Converting the dates into integers

In [8]:
# Select non-NaN values in the 'Date' column
non_nan_dates = df1['Date'].notna()

# Set the non-NaN values to integer type
df1.loc[non_nan_dates, 'Date'] = df1.loc[non_nan_dates, 'Date'].astype(int).round()

## From Album

In [9]:
# Group by 'Album'  and 'Artist'
grouped_df = df1.groupby(['Album', 'Artist'])

# Iterate through groups and print each group with 'Date'
for name, group in grouped_df:
    print(f"Album: {name}")
    print(group[['Track', 'Artist', 'Album', 'Date']])
    print('\n')

Album: ('#1 Girl', 'Mindless Behavior')
             Track             Artist    Album Date
57436   Mrs. Right  Mindless Behavior  #1 Girl  NaN
159511  Mrs. Right  Mindless Behavior  #1 Girl  NaN
164703  Mrs. Right  Mindless Behavior  #1 Girl  NaN
220239  Mrs. Right  Mindless Behavior  #1 Girl  NaN
237347  Mrs. Right  Mindless Behavior  #1 Girl  NaN
299169  Mrs. Right  Mindless Behavior  #1 Girl  NaN
451916  Mrs. Right  Mindless Behavior  #1 Girl  NaN
556051  Mrs. Right  Mindless Behavior  #1 Girl  NaN
582367  Mrs. Right  Mindless Behavior  #1 Girl  NaN
633966  Mrs. Right  Mindless Behavior  #1 Girl  NaN
639851  Mrs. Right  Mindless Behavior  #1 Girl  NaN
686297  Mrs. Right  Mindless Behavior  #1 Girl  NaN
719168  Mrs. Right  Mindless Behavior  #1 Girl  NaN
799374  Mrs. Right  Mindless Behavior  #1 Girl  NaN


Album: ('#CoolUrbanNewTalent', 'Etta Bond')
             Track     Artist                Album Date
681080  Feels Like  Etta Bond  #CoolUrbanNewTalent  NaN
725133  Feels Like  Et

In [10]:
# Filter albums with at least one non-null date
albums_with_dates = df1.dropna(subset=['Date']).groupby(['Album', 'Artist']).size().index

# Group by 'Album' and 'Artist' for filtered albums
filtered_grouped_df = df1[df1.set_index(['Album', 'Artist']).index.isin(albums_with_dates)].groupby(['Album', 'Artist'])

# Iterate through filtered groups and print each group with 'Date'
for (album, artist), group in filtered_grouped_df:
    print(f"Album: {album}, Artist: {artist}")
    print(group[['Track', 'Artist', 'Album', 'Date']])
    print('\n')

Album: #SELFIE, Artist: The Chainsmokers
          Track            Artist    Album  Date
11816   #SELFIE  The Chainsmokers  #SELFIE  2014
16313   #SELFIE  The Chainsmokers  #SELFIE  2014
30969   #SELFIE  The Chainsmokers  #SELFIE  2014
33911   #SELFIE  The Chainsmokers  #SELFIE  2014
40439   #SELFIE  The Chainsmokers  #SELFIE  2014
...         ...               ...      ...   ...
796177  #SELFIE  The Chainsmokers  #SELFIE  2014
797552  #SELFIE  The Chainsmokers  #SELFIE  2014
801758  #SELFIE  The Chainsmokers  #SELFIE  2014
804629  #SELFIE  The Chainsmokers  #SELFIE  2014
805047  #SELFIE  The Chainsmokers  #SELFIE  2014

[156 rows x 4 columns]


Album: '74 Jailbreak, Artist: AC/DC
            Track Artist          Album  Date
94716   Jailbreak  AC/DC  '74 Jailbreak  1984
279254  Jailbreak  AC/DC  '74 Jailbreak  1984
572582  Jailbreak  AC/DC  '74 Jailbreak  1984
580193  Jailbreak  AC/DC  '74 Jailbreak  1984
784350  Jailbreak  AC/DC  '74 Jailbreak  1984


Album: '90s Soul Number 1's, Ar

In [11]:
# Define a function to impute missing dates within each group
def impute_dates(group):
    group['Date'] = group['Date'].fillna(group['Date'].max())
    return group

# Apply the impute_dates function to each group
filtered = grouped_df.apply(impute_dates)
filtered

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Album,Artist,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
#1 Girl,Mindless Behavior,57436,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,2010,49,39,25,1,False,11611
#1 Girl,Mindless Behavior,159511,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,my stuff,130,96,72,3,False,102874
#1 Girl,Mindless Behavior,164703,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,Throwback,84,74,57,1,False,103196
#1 Girl,Mindless Behavior,220239,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,car jams,104,53,35,1,False,106503
#1 Girl,Mindless Behavior,237347,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,tunes,237,127,75,1,False,107561
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
÷,Ed Sheeran,808557,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,❤️,84,59,43,1,False,142973
÷,Ed Sheeran,808851,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,LET'S GO,111,59,38,1,False,142990
スカー藩主,Scarlxrd,107907,spotify:track:44AH8XwgrebHvQAZJbSrsV,P.T.S.D,Scarlxrd,スカー藩主,,307905,Dark Trap,0.473,0.772,10.0,...,0.303,130.320,4.0,MAD TING,27,14,11,1,False,14661
スカー藩主,Scarlxrd,107930,spotify:track:44AH8XwgrebHvQAZJbSrsV,P.T.S.D,Scarlxrd,スカー藩主,,307905,Underground Rap,0.473,0.772,10.0,...,0.303,130.320,4.0,MAD TING,27,14,11,1,False,14661


#### Checking the before and after imputation

In [12]:
# Filter groups (albums) with at least one non-null date
valid_albums = grouped_df.filter(lambda x: x['Date'].notnull().any())

# Display the DataFrame before imputation
print("Before Imputation:")
print(valid_albums)

Before Imputation:
                                   Track URI             Track  \
0       spotify:track:6sqNctd7MlJoKDOxPVCAvU   My Happy Ending   
2       spotify:track:34ceTg8ChN5HjrqiIYCn9Q  Miss Independent   
4       spotify:track:7H6ev70Weq6DdpZyyTmUXk       Say My Name   
5       spotify:track:7H6ev70Weq6DdpZyyTmUXk       Say My Name   
7       spotify:track:4pmc2AxSEq6g7hPVlJCPyP  Jumpin', Jumpin'   
...                                      ...               ...   
809165  spotify:track:5u0YB9bpmgEPS2bPhwfRFV              arms   
809166  spotify:track:3gbBpTdY8lnQwqxNCcf795           Pompeii   
809169  spotify:track:1u0l8zWpQeMYStFkc2mLD7        Everywhere   
809171  spotify:track:64GRDrL1efgXclrhVCeuA0       Lay Me Down   
809173  spotify:track:2iUmqdfGZcHIhS3b9E9EWq   Everybody Talks   

                 Artist                      Album  Date  Duration  \
0         Avril Lavigne              Under My Skin  2004    242413   
2                 Ne-Yo      Year Of The Gentlem

In [13]:
# Define a function to impute missing dates within each group
def impute_dates(group):
    group['Date'] = group['Date'].fillna(group['Date'].max())
    return group

# Apply the impute_dates function to each group
filtered_grouped_df = valid_albums.groupby(['Album', 'Artist']).apply(impute_dates)

# Display the DataFrame after imputation
print("\nAfter Imputation:")
print(filtered_grouped_df)


After Imputation:
                                                            Track URI  \
Album   Artist                                                          
#SELFIE The Chainsmokers 11816   spotify:track:0zkiQH567SDLqfWNBaU3hv   
                         16313   spotify:track:0zkiQH567SDLqfWNBaU3hv   
                         30969   spotify:track:0zkiQH567SDLqfWNBaU3hv   
                         33911   spotify:track:0zkiQH567SDLqfWNBaU3hv   
                         40439   spotify:track:0zkiQH567SDLqfWNBaU3hv   
...                                                               ...   
xx      The xx           804446  spotify:track:0bXpmJyHHYPk6QBFj25bYF   
                         804919  spotify:track:0bXpmJyHHYPk6QBFj25bYF   
                         807013  spotify:track:0bXpmJyHHYPk6QBFj25bYF   
                         807156  spotify:track:0bXpmJyHHYPk6QBFj25bYF   
                         808215  spotify:track:0bXpmJyHHYPk6QBFj25bYF   

                               

In [17]:
df2 = filtered.copy()
# percentage of NA's
print(len(df2[df2['Date'].isna()]))
len(df2[df2['Date'].isna()]) / len(df2)

439177


0.5427459539086675

## From Artist

In [21]:
df2.reset_index(drop=True, inplace=True)
df2

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
0,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,2010,49,39,25,1,False,11611
1,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,my stuff,130,96,72,3,False,102874
2,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,Throwback,84,74,57,1,False,103196
3,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,car jams,104,53,35,1,False,106503
4,spotify:track:4N4CHJqFZHyB7SBUuSFu1y,Mrs. Right,Mindless Behavior,#1 Girl,,248187,RnB,0.439,0.753,0.0,...,0.538,190.678,5.0,tunes,237,127,75,1,False,107561
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809171,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,❤️,84,59,43,1,False,142973
809172,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,LET'S GO,111,59,38,1,False,142990
809173,spotify:track:44AH8XwgrebHvQAZJbSrsV,P.T.S.D,Scarlxrd,スカー藩主,,307905,Dark Trap,0.473,0.772,10.0,...,0.303,130.320,4.0,MAD TING,27,14,11,1,False,14661
809174,spotify:track:44AH8XwgrebHvQAZJbSrsV,P.T.S.D,Scarlxrd,スカー藩主,,307905,Underground Rap,0.473,0.772,10.0,...,0.303,130.320,4.0,MAD TING,27,14,11,1,False,14661


In [23]:
# Group by 'Artist' only
artist_group = df2.groupby('Artist')

# Filter groups (artists) with at least one non-null date
valid_artists = artist_group.filter(lambda x: x['Date'].notnull().any())

# Display the DataFrame before imputation
valid_artists.sort_values(by='Artist')

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
487810,spotify:track:4r8lRYnoOGdEi6YyI5OC1o,Bye Bye Bye,*NSYNC,No Strings Attached,2000.0,200560,"boy band,dance pop,pop",0.614,0.928,8.0,...,0.879,172.656,4.0,00's,44,44,41,1,False,100529
488076,spotify:track:4r8lRYnoOGdEi6YyI5OC1o,Bye Bye Bye,*NSYNC,No Strings Attached,2000.0,200560,"boy band,dance pop,pop",0.614,0.928,8.0,...,0.879,172.656,4.0,Old Stuff,120,115,104,3,False,116786
488077,spotify:track:4r8lRYnoOGdEi6YyI5OC1o,Bye Bye Bye,*NSYNC,No Strings Attached,2000.0,200560,"boy band,dance pop,pop",0.614,0.928,8.0,...,0.879,172.656,4.0,Stronger,44,40,31,1,False,117118
488078,spotify:track:4r8lRYnoOGdEi6YyI5OC1o,Bye Bye Bye,*NSYNC,No Strings Attached,2000.0,200560,"boy band,dance pop,pop",0.614,0.928,8.0,...,0.879,172.656,4.0,Housewarming,134,126,117,3,False,117245
488079,spotify:track:35zGjsxI020C2NPKp2fzS7,It's Gonna Be Me,*NSYNC,No Strings Attached,2000.0,191040,"boy band,dance pop,pop",0.644,0.874,0.0,...,0.882,165.090,4.0,Throwback,218,166,108,3,False,117404
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
305,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Wedding,208,181,127,1,False,113946
306,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Party Hits,231,172,126,1,False,114208
307,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,My Music <3,106,85,62,1,False,114342
297,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Hip hop,116,101,78,2,False,112695


In [24]:
valid_artists[(valid_artists['Artist'] == 'Beyoncé') & (valid_artists['Date'].isna())]

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
98434,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,oldies,94,84,74,3,False,129
98435,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,go to,99,92,68,1,False,131
98436,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,motivational,41,35,33,1,False,172
98437,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,🤤🤤,17,15,6,1,False,180
98438,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,TBT,31,30,23,1,False,250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98936,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,childhood jams,112,83,62,2,False,142608
98937,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,old skool,123,105,77,3,False,142774
98938,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,Beyonce,59,8,3,1,False,142923
98939,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,lit,133,105,85,2,False,142956


In [25]:
valid_artists[(valid_artists['Artist'] == 'will.i.am')]

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
188,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,party music,176,156,118,2,False,124
189,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,throw backs,70,62,45,1,False,355
190,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Party time,29,28,24,1,False,388
191,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Sweat.It.Out.,22,22,21,1,True,413
192,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Pop,104,97,77,1,False,729
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583771,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,turn up,173,157,130,4,False,120533
583772,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,Anita,37,36,34,1,False,123490
583773,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,Gaby,181,137,70,3,False,126013
583774,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,APRIL 2017,28,28,25,2,False,130235


In [26]:
valid_artists[(valid_artists['Artist'] == 'Ed Sheeran') & (valid_artists['Date'].isna())]

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
174102,spotify:track:5X6TnKT37TaSDkFm0598Uo,Castle on the Hill - Acoustic,Ed Sheeran,Castle on the Hill,,226227,Emo,0.563,0.260,2.0,...,0.616,145.561,4.0,Acoustic songs,32,32,29,1,False,10104
174103,spotify:track:5X6TnKT37TaSDkFm0598Uo,Castle on the Hill - Acoustic,Ed Sheeran,Castle on the Hill,,226227,Emo,0.563,0.260,2.0,...,0.616,145.561,4.0,Acoustic Pop,134,130,106,2,False,11323
174104,spotify:track:5X6TnKT37TaSDkFm0598Uo,Castle on the Hill - Acoustic,Ed Sheeran,Castle on the Hill,,226227,Emo,0.563,0.260,2.0,...,0.616,145.561,4.0,chill vibes,36,32,27,1,False,13239
174105,spotify:track:5X6TnKT37TaSDkFm0598Uo,Castle on the Hill - Acoustic,Ed Sheeran,Castle on the Hill,,226227,Emo,0.563,0.260,2.0,...,0.616,145.561,4.0,NEW,42,42,37,1,False,13327
174106,spotify:track:5X6TnKT37TaSDkFm0598Uo,Castle on the Hill - Acoustic,Ed Sheeran,Castle on the Hill,,226227,Emo,0.563,0.260,2.0,...,0.616,145.561,4.0,Happy,200,141,95,1,False,13769
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809168,spotify:track:0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,÷,,263400,Pop,0.599,0.448,8.0,...,0.168,190.100,3.0,lit,133,105,85,2,False,142956
809169,spotify:track:0afhq8XCExXpqazXczTSve,Galway Girl,Ed Sheeran,÷,,170827,Pop,0.624,0.876,9.0,...,0.781,199.886,4.0,Sydney,118,102,86,1,False,142964
809170,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,❤️,84,59,43,1,False,142973
809171,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,❤️,84,59,43,1,False,142973


In [27]:
valid_artists[(valid_artists['Artist'] == '112')]

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
8672,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,Love Music,188,157,103,2,False,156
8673,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,Seduction,120,97,63,1,False,930
8674,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,David,33,31,26,1,True,989
8675,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,R&B classics,110,89,69,1,False,1104
8676,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,R&B classics,110,89,69,1,False,1104
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
538215,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,workout,87,80,72,1,False,139415
538216,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,ee,97,91,79,1,False,139887
538217,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,R&B Jams,171,139,103,1,False,141313
538218,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,san Diego,77,64,51,4,False,141406


In [28]:
# Define a function to impute missing dates within each group
def impute_dates(group):
    group['Date'] = group['Date'].fillna(group['Date'].mean())
    return group

# Apply the impute_dates function to each group
filtered2 = artist_group.apply(impute_dates)
filtered2

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
$hun,699846,spotify:track:3NRgC9yCkATmwnhMo1dZI7,Nikes on My Feet (Extended),$hun,The Rough Draft,,328411,Underground Rap,0.808,0.488,9.0,...,0.4820,184.142,4.0,Wavy ~,130,78,53,3,False,105673
$uicideBoy$,482,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Trap Metal,0.891,0.670,1.0,...,0.1800,212.032,4.0,Adrian,118,81,61,1,False,105780
$uicideBoy$,483,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Underground Rap,0.891,0.670,1.0,...,0.1800,106.016,4.0,Adrian,118,81,61,1,False,105780
$uicideBoy$,484,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Dark Trap,0.891,0.670,1.0,...,0.1800,212.032,4.0,Adrian,118,81,61,1,False,105780
$uicideBoy$,485,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Trap Metal,0.891,0.670,1.0,...,0.1800,212.032,4.0,new new,131,90,59,1,False,117374
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
xxyyxx,782409,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,new age,73,69,66,1,False,130210
xxyyxx,782410,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,High,62,55,48,1,False,131497
xxyyxx,782411,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,Psychedelic,39,23,16,1,False,135446
xxyyxx,782412,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,Feels,104,86,68,3,False,140796


In [29]:
filtered2[filtered2['Track']=='Irreplaceable']

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Beyoncé,98434,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,oldies,94,84,74,3,False,129
Beyoncé,98435,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,go to,99,92,68,1,False,131
Beyoncé,98436,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,motivational,41,35,33,1,False,172
Beyoncé,98437,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,🤤🤤,17,15,6,1,False,180
Beyoncé,98438,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,TBT,31,30,23,1,False,250
Beyoncé,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Beyoncé,98936,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,childhood jams,112,83,62,2,False,142608
Beyoncé,98937,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,old skool,123,105,77,3,False,142774
Beyoncé,98938,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,Beyonce,59,8,3,1,False,142923
Beyoncé,98939,spotify:track:07lxDm1s8FVO4GF54Nooiz,Irreplaceable,Beyoncé,B'Day,2009.832061,227667,RnB,0.576,0.697,10.0,...,0.501,175.906,4.0,lit,133,105,85,2,False,142956


In [30]:
filtered2[(filtered2['Artist'] == 'will.i.am')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
will.i.am,188,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,2007.0,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,party music,176,156,118,2,False,124
will.i.am,189,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,2007.0,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,throw backs,70,62,45,1,False,355
will.i.am,190,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,2007.0,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Party time,29,28,24,1,False,388
will.i.am,191,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,2007.0,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Sweat.It.Out.,22,22,21,1,True,413
will.i.am,192,spotify:track:4DMPn1rEujpIJIvjy9HKV8,Scream & Shout,will.i.am,#willpower,2007.0,283400,RnB,0.770,0.684,0.0,...,0.479,130.030,4.0,Pop,104,97,77,1,False,729
will.i.am,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
will.i.am,583771,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,turn up,173,157,130,4,False,120533
will.i.am,583772,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,Anita,37,36,34,1,False,123490
will.i.am,583773,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,Gaby,181,137,70,3,False,126013
will.i.am,583774,spotify:track:1z7FW0nlEBGtQWQ19kz7qp,I Got It From My Mama,will.i.am,Songs About Girls,2007.0,241520,"dance pop,pop",0.888,0.777,6.0,...,0.876,118.998,4.0,APRIL 2017,28,28,25,2,False,130235


In [31]:
filtered2[(filtered2['Artist'] == 'Ed Sheeran')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Ed Sheeran,2103,spotify:track:53Pgsvu3qSYO2aXt5J2vcL,Lego House,Ed Sheeran,+,2011.000000,185093,"pop,singer-songwriter pop,uk pop",0.592,0.637,11.0,...,0.565,159.701,4.0,Mom's playlist,79,69,61,1,False,17
Ed Sheeran,2104,spotify:track:53Pgsvu3qSYO2aXt5J2vcL,Lego House,Ed Sheeran,+,2011.000000,185093,"pop,singer-songwriter pop,uk pop",0.592,0.637,11.0,...,0.565,159.701,4.0,Road Trippin',158,131,106,2,False,42
Ed Sheeran,2105,spotify:track:1VdZ0vKfR5jneCmWIUAMxK,The A Team,Ed Sheeran,+,2011.000000,258373,Emo,0.642,0.289,9.0,...,0.407,169.992,4.0,Chill,66,54,30,1,False,80
Ed Sheeran,2106,spotify:track:1VdZ0vKfR5jneCmWIUAMxK,The A Team,Ed Sheeran,+,2011.000000,258373,"pop,singer-songwriter pop,uk pop",0.642,0.289,9.0,...,0.407,84.996,4.0,Chill,66,54,30,1,False,80
Ed Sheeran,2107,spotify:track:1VdZ0vKfR5jneCmWIUAMxK,The A Team,Ed Sheeran,+,2011.000000,258373,"pop,singer-songwriter pop,uk pop",0.642,0.289,9.0,...,0.407,84.996,4.0,Catchy Songs,98,82,67,1,False,93
Ed Sheeran,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ed Sheeran,809168,spotify:track:0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,÷,2011.034059,263400,Pop,0.599,0.448,8.0,...,0.168,190.100,3.0,lit,133,105,85,2,False,142956
Ed Sheeran,809169,spotify:track:0afhq8XCExXpqazXczTSve,Galway Girl,Ed Sheeran,÷,2011.034059,170827,Pop,0.624,0.876,9.0,...,0.781,199.886,4.0,Sydney,118,102,86,1,False,142964
Ed Sheeran,809170,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,2011.034059,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,❤️,84,59,43,1,False,142973
Ed Sheeran,809171,spotify:track:7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,÷,2011.034059,233713,Pop,0.825,0.652,1.0,...,0.931,191.954,4.0,❤️,84,59,43,1,False,142973


In [32]:
filtered2[(filtered2['Artist'] == '112')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
112,8672,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,2001.0,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,Love Music,188,157,103,2,False,156
112,8673,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,2001.0,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,Seduction,120,97,63,1,False,930
112,8674,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,2001.0,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,David,33,31,26,1,True,989
112,8675,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,2001.0,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,R&B classics,110,89,69,1,False,1104
112,8676,spotify:track:3kVIFDE3G89I2RPVkiRaRj,Cupid,112,112,2001.0,252267,RnB,0.685,0.380,8.0,...,0.870,175.562,4.0,R&B classics,110,89,69,1,False,1104
112,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112,538215,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,2001.0,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,workout,87,80,72,1,False,139415
112,538216,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,2001.0,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,ee,97,91,79,1,False,139887
112,538217,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,2001.0,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,R&B Jams,171,139,103,1,False,141313
112,538218,spotify:track:6iajHa34cSiD5s42Cq9miJ,Peaches & Cream (feat. P. Diddy),112,R&B Hits,2001.0,225947,Hiphop,0.706,0.537,7.0,...,0.775,203.746,4.0,san Diego,77,64,51,4,False,141406


Here, we see that Beyoncé's song 'Irreplaceable' had the date imputed as 2010, when the actual date is 2006. For 'Scream and Shout', the imputed date is 2007 when the actual date is 2013. These aren't too far. 

For Ed Sheeran all songs have been imputed to 2011. The actual dates are: 'Shape of you': 2017; 'Photograph': 2014; 'Galway girl': 2017; 'Castle on the Hill': 2017; 'Perfect': 2017. 

The dates of songs from 112 have all been imputed to 2001. The actual dates are 2001, 1996, 2005, 2005.

All of those have less than a 10 year difference. So we can assume that our imputation based on artists is valid.

In [33]:
df3 = filtered2.copy()
# percentage of NA's
print(len(df3[df3['Date'].isna()]))
len(df3[df3['Date'].isna()]) / len(df3)

351934


0.43492886590803487

## From Pid

In [34]:
df3[df3['Date'].isna()]

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
$hun,699846,spotify:track:3NRgC9yCkATmwnhMo1dZI7,Nikes on My Feet (Extended),$hun,The Rough Draft,,328411,Underground Rap,0.808,0.488,9.0,...,0.4820,184.142,4.0,Wavy ~,130,78,53,3,False,105673
$uicideBoy$,482,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Trap Metal,0.891,0.670,1.0,...,0.1800,212.032,4.0,Adrian,118,81,61,1,False,105780
$uicideBoy$,483,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Underground Rap,0.891,0.670,1.0,...,0.1800,106.016,4.0,Adrian,118,81,61,1,False,105780
$uicideBoy$,484,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Dark Trap,0.891,0.670,1.0,...,0.1800,212.032,4.0,Adrian,118,81,61,1,False,105780
$uicideBoy$,485,spotify:track:6lrOr4Ks7b6B03n0YmKjND,Cold Turkey,$uicideBoy$,$outh $ide $uicide,,182334,Trap Metal,0.891,0.670,1.0,...,0.1800,212.032,4.0,new new,131,90,59,1,False,117374
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
xxyyxx,782409,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,new age,73,69,66,1,False,130210
xxyyxx,782410,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,High,62,55,48,1,False,131497
xxyyxx,782411,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,Psychedelic,39,23,16,1,False,135446
xxyyxx,782412,spotify:track:6TMSo0TqIKaotFuWeUKxc6,Dmt,xxyyxx,Xxyyxx,,260206,Dark Trap,0.537,0.424,1.0,...,0.0761,193.994,4.0,Feels,104,86,68,3,False,140796


In [36]:
df4 = df3.copy()
# Group by the 'Pid' column and calculate the mean date for each playlist
missing_dates = df3['Date'].isna()
playlist_mean_dates = df3.groupby('Pid')['Date'].transform('mean')
# Impute missing values in the 'Date' column with the mean date of the corresponding playlist
df4.loc[missing_dates, 'Date'] = playlist_mean_dates[missing_dates]
# Check if there are non-numeric values in the 'Date' column
non_numeric_dates = pd.to_numeric(df4['Date'], errors='coerce').isna()
# If there are non-numeric values, handle or remove them
# For example, you can replace non-numeric values with NaN
df4.loc[non_numeric_dates, 'Date'] = pd.to_numeric(df4.loc[non_numeric_dates, 'Date'], errors='coerce')
# Impute NaN values with the mean (you can choose a different imputation strategy)
df4['Date'].fillna(df4['Date'].mean(), inplace=True)
# Convert the 'Date' column to int type
df4['Date'] = df4['Date'].astype(int)
df4.iloc[df3[df3['Date'].isna()].index]

In [37]:
df4[df4['Track']=="Leavin'"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Jesse McCartney,235669,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2003,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,Throwbacks,52,47,37,1,False,0
Jesse McCartney,235670,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2005,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,Feels,94,86,74,1,False,431
Jesse McCartney,235671,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2006,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,Throwbacks,56,55,46,1,False,717
Jesse McCartney,235672,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2005,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,2000s,79,74,61,1,False,747
Jesse McCartney,235673,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2004,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,oldies,107,96,73,2,False,1174
Jesse McCartney,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Jesse McCartney,235930,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2013,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,Best,10,9,9,1,False,142189
Jesse McCartney,235931,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2006,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,Road Trip,67,60,48,1,False,142306
Jesse McCartney,235932,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2008,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,tb,94,80,57,1,False,142538
Jesse McCartney,235933,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2001,216880,RnB,0.687,0.71,9.0,...,0.886,158.47,4.0,Randoms,59,43,36,1,False,142860


Now, notice that each song is repeated several times (here Leavin' is repeated 266 times) because they appear in different playlists. We can take the mean value for the dates that were imputed for each song for each playlist.

In [38]:
df4.groupby(['Track', 'Artist'], as_index = False).filter(lambda x: len(x) == 1)

ValueError: 'Artist' is both an index level and a column label, which is ambiguous.

In [None]:
df_mean_date = df4.groupby(['Track URI'])['Date'].mean().reset_index()
df_mean_date

Unnamed: 0,Track URI,Date
0,spotify:track:00Dj0k3r0a6HKTLanwET8L,2009.000000
1,spotify:track:00FROhC5g4iJdax5US8jRr,2000.461538
2,spotify:track:00LfFm08VWeZwB0Zlm24AT,2005.771791
3,spotify:track:00MI0oGDVJYM1qWbyUOIhH,2008.000000
4,spotify:track:00tB8c71eTcG5jV7PhuF4Q,2002.000000
...,...,...
4826,spotify:track:7zFXmv6vqI4qOt4yGf3jYZ,2012.088415
4827,spotify:track:7zP67rufQgoODWFI45jntD,2008.000000
4828,spotify:track:7zQ5nqAKKfk0gtBgV70gyq,2004.000000
4829,spotify:track:7zez4ZwqfSqD6fPQgcnqwu,2013.000000


In [None]:
df4

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
0,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004,242413,"canadian pop,candy pop,pop",0.414,0.936,2.0,...,0.740,170.229,4.0,Throwbacks,52,47,37,1,False,0
1,spotify:track:3BxWKCI06eQ5Od8TY2JBeA,Buttons,The Pussycat Dolls,PCD,2008,225560,RnB,0.570,0.821,2.0,...,0.408,210.857,4.0,Throwbacks,52,47,37,1,False,0
2,spotify:track:34ceTg8ChN5HjrqiIYCn9Q,Miss Independent,Ne-Yo,Year Of The Gentleman,2008,232000,RnB,0.673,0.683,1.0,...,0.713,171.860,4.0,Throwbacks,52,47,37,1,False,0
3,spotify:track:67T6l4q3zVjC5nZZPXByU8,Whatcha Say,Jason Derulo,Jason Derulo,2013,221253,RnB,0.615,0.711,11.0,...,0.711,144.036,4.0,Throwbacks,52,47,37,1,False,0
4,spotify:track:7H6ev70Weq6DdpZyyTmUXk,Say My Name,Destiny's Child,The Writing's On The Wall,1999,271333,RnB,0.713,0.678,5.0,...,0.734,138.009,4.0,Throwbacks,52,47,37,1,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809171,spotify:track:64GRDrL1efgXclrhVCeuA0,Lay Me Down,Sam Smith,In The Lonely Hour,2014,219536,Emo,0.468,0.190,4.0,...,0.326,125.319,4.0,Feel Good,149,84,62,1,False,142998
809172,spotify:track:4kdfjhj9xNkYU0R8xlDy8k,6 Man,Drake,If You're Reading This It's Too Late,2016,167653,Hiphop,0.568,0.535,4.0,...,0.326,167.178,4.0,Feel Good,149,84,62,1,False,142998
809173,spotify:track:2iUmqdfGZcHIhS3b9E9EWq,Everybody Talks,Neon Trees,Picture Show,2012,177280,"modern rock,neo mellow,pop rock,pov: indie",0.471,0.924,8.0,...,0.725,154.961,4.0,Feel Good,149,84,62,1,False,142998
809174,spotify:track:5InOp6q2vvx0fShv3bzFLZ,Know Yourself,Drake,If You're Reading This It's Too Late,2016,275840,Rap,0.720,0.412,11.0,...,0.179,114.408,4.0,Feel Good,149,84,62,1,False,142998


In [None]:
df_mean_dates = pd.DataFrame(df_mean_date)
df_merged = pd.merge(df4, df_mean_dates, on='Track URI', how='left')
# drop the original date column
df_merged = df_merged.drop(columns=['Date_x'])
df_merged = df_merged.rename(columns={'Date_y':'Date'})
df_merged = df_merged.reindex(columns=['Track URI', 'Track', 'Artist', 'Album', 'Date', 'Duration', 'Genre',
       'Danceability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness',
       'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo',
       'Time Signature', 'Playlist', 'Num_Tracks', 'Num_Albums', 'Num_Artists',
       'Follow', 'Collab', 'Pid'])
df_merged

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
0,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,"canadian pop,candy pop,pop",0.414,0.936,2.0,...,0.740,170.229,4.0,Throwbacks,52,47,37,1,False,0
1,spotify:track:3BxWKCI06eQ5Od8TY2JBeA,Buttons,The Pussycat Dolls,PCD,2008.0,225560,RnB,0.570,0.821,2.0,...,0.408,210.857,4.0,Throwbacks,52,47,37,1,False,0
2,spotify:track:34ceTg8ChN5HjrqiIYCn9Q,Miss Independent,Ne-Yo,Year Of The Gentleman,2008.0,232000,RnB,0.673,0.683,1.0,...,0.713,171.860,4.0,Throwbacks,52,47,37,1,False,0
3,spotify:track:67T6l4q3zVjC5nZZPXByU8,Whatcha Say,Jason Derulo,Jason Derulo,2013.0,221253,RnB,0.615,0.711,11.0,...,0.711,144.036,4.0,Throwbacks,52,47,37,1,False,0
4,spotify:track:7H6ev70Weq6DdpZyyTmUXk,Say My Name,Destiny's Child,The Writing's On The Wall,1999.0,271333,RnB,0.713,0.678,5.0,...,0.734,138.009,4.0,Throwbacks,52,47,37,1,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809171,spotify:track:64GRDrL1efgXclrhVCeuA0,Lay Me Down,Sam Smith,In The Lonely Hour,2014.0,219536,Emo,0.468,0.190,4.0,...,0.326,125.319,4.0,Feel Good,149,84,62,1,False,142998
809172,spotify:track:4kdfjhj9xNkYU0R8xlDy8k,6 Man,Drake,If You're Reading This It's Too Late,2016.0,167653,Hiphop,0.568,0.535,4.0,...,0.326,167.178,4.0,Feel Good,149,84,62,1,False,142998
809173,spotify:track:2iUmqdfGZcHIhS3b9E9EWq,Everybody Talks,Neon Trees,Picture Show,2012.0,177280,"modern rock,neo mellow,pop rock,pov: indie",0.471,0.924,8.0,...,0.725,154.961,4.0,Feel Good,149,84,62,1,False,142998
809174,spotify:track:5InOp6q2vvx0fShv3bzFLZ,Know Yourself,Drake,If You're Reading This It's Too Late,2016.0,275840,Rap,0.720,0.412,11.0,...,0.179,114.408,4.0,Feel Good,149,84,62,1,False,142998


In [None]:
# checking that the date was modified
df_merged[df_merged['Track URI']=='spotify:track:6sqNctd7MlJoKDOxPVCAvU']

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
0,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,"canadian pop,candy pop,pop",0.414,0.936,2.0,...,0.74,170.229,4.0,Throwbacks,52,47,37,1,False,0
21,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,Emo,0.414,0.936,2.0,...,0.74,170.229,4.0,Throwbacks,52,47,37,1,False,0
16617,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,Emo,0.414,0.936,2.0,...,0.74,170.229,4.0,Dancing on my own,172,69,35,1,True,987
16622,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,"canadian pop,candy pop,pop",0.414,0.936,2.0,...,0.74,170.229,4.0,Dancing on my own,172,69,35,1,True,987
23249,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,Emo,0.414,0.936,2.0,...,0.74,170.229,4.0,fall,67,55,46,1,False,1443
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
779012,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,Emo,0.414,0.936,2.0,...,0.74,170.229,4.0,Sad :(((,70,63,47,1,False,141033
781222,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,Emo,0.414,0.936,2.0,...,0.74,170.229,4.0,Newer,184,124,73,1,False,141180
781224,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,"canadian pop,candy pop,pop",0.414,0.936,2.0,...,0.74,170.229,4.0,Newer,184,124,73,1,False,141180
792617,spotify:track:6sqNctd7MlJoKDOxPVCAvU,My Happy Ending,Avril Lavigne,Under My Skin,2004.0,242413,"canadian pop,candy pop,pop",0.414,0.936,2.0,...,0.74,170.229,4.0,2000s jams,70,53,34,1,False,141938


In [None]:
df_merged['Date'] = df_merged['Date'].astype(int)
df4['Date'] = df4['Date'].astype(int)

# df_merged['Date'] = df_merged['Date'].astype(str)
# df4['Date'] = df4['Date'].astype(str)

In [None]:
df_merged[df_merged['Date'] != df4['Date']]

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
6,spotify:track:20ORwCJusz4KS2PbTPVNKo,Leavin',Jesse McCartney,Departure - Recharged,2005,216880,RnB,0.687,0.710,9.0,...,0.8860,158.470,4.0,Throwbacks,52,47,37,1,False,0
8,spotify:track:7DFnq8FYhHMCylykf6ZCxA,Yo (Excuse Me Miss),Chris Brown,Chris Brown,2005,229040,RnB,0.536,0.612,4.0,...,0.5700,173.536,4.0,Throwbacks,52,47,37,1,False,0
9,spotify:track:7k6IzwMGpxnRghE7YosnXT,Me & U,Cassie,Cassie,2005,192213,RnB,0.803,0.454,8.0,...,0.7390,199.980,4.0,Throwbacks,52,47,37,1,False,0
10,spotify:track:7k6IzwMGpxnRghE7YosnXT,Me & U,Cassie,Cassie,2005,192213,Hiphop,0.803,0.454,8.0,...,0.7390,199.980,4.0,Throwbacks,52,47,37,1,False,0
11,spotify:track:1TfAhjzRBWzYZ8IdUV3igl,Year 3000,Jonas Brothers,Jonas Brothers,2006,201960,Emo,0.659,0.869,11.0,...,0.8110,106.966,4.0,Throwbacks,52,47,37,1,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809139,spotify:track:7m9OqQk4RVRkw9JJdeAw96,Jocelyn Flores,XXXTENTACION,17,2013,119133,Underground Rap,0.872,0.391,0.0,...,0.4370,134.021,4.0,getting ready,27,27,21,1,False,142997
809141,spotify:track:5tz69p7tJuGPeMGwNTxYuV,1-800-273-8255,Logic,Everybody,2013,250173,Emo,0.620,0.574,5.0,...,0.3570,200.046,4.0,getting ready,27,27,21,1,False,142997
809144,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2007,258343,Rap,0.641,0.922,2.0,...,0.8470,146.078,4.0,Feel Good,149,84,62,1,False,142998
809168,spotify:track:5TvE3pk05pyFIGdSY9j4DJ,Say Something,A Great Big World,Is There Anybody Out There? - Track by Track C...,2008,229400,Emo,0.407,0.147,2.0,...,0.0765,141.284,3.0,Feel Good,149,84,62,1,False,142998


In [None]:
df4[df4['Track URI']=='spotify:track:3bidbhpOYeV4knp8AIu8Xn']

Unnamed: 0,Track URI,Track,Artist,Album,Date,Duration,Genre,Danceability,Energy,Key,...,Valence,Tempo,Time Signature,Playlist,Num_Tracks,Num_Albums,Num_Artists,Follow,Collab,Pid
91,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2003,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,Wedding,80,71,56,1,False,5
589,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2010,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,Road Trippin',158,131,106,2,False,42
1097,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2013,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,Gym,72,63,45,2,False,85
2913,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2015,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,Alexia,41,25,21,1,False,186
3180,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2015,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,Bus playlist,52,41,34,2,False,209
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
805191,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2009,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,randomness,162,70,80,1,False,142740
805958,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2013,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,Summerrrr,77,70,66,1,False,142786
806848,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2006,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,2000s.,139,115,75,17,False,142830
808608,spotify:track:3bidbhpOYeV4knp8AIu8Xn,Can't Hold Us - feat. Ray Dalton,Macklemore & Ryan Lewis,The Heist,2014,258343,Rap,0.641,0.922,2.0,...,0.847,146.078,4.0,32,20,20,19,4,False,142975


In [None]:
df_merged[df_merged['Track URI']=='spotify:track:3bidbhpOYeV4knp8AIu8Xn'].iloc[0,:]

Track URI           spotify:track:3bidbhpOYeV4knp8AIu8Xn
Track                   Can't Hold Us - feat. Ray Dalton
Artist                           Macklemore & Ryan Lewis
Album                                          The Heist
Date                                                2007
Duration                                          258343
Genre                                                Rap
Danceability                                       0.641
Energy                                             0.922
Key                                                  2.0
Loudness                                          -4.457
Mode                                                 1.0
Speechiness                                       0.0786
Acousticness                                      0.0291
Instrumentalness                                     0.0
Liveness                                          0.0862
Valence                                            0.847
Tempo                          

In [None]:
df_merged[df_merged['Track']=="wokeuplikethis*"].iloc[0,:]

Track URI           spotify:track:59J5nzL1KniFHnU120dQzt
Track                                    wokeuplikethis*
Artist                                     Playboi Carti
Album                                      Playboi Carti
Date                                                2014
Duration                                          235535
Genre                                                RnB
Danceability                                       0.785
Energy                                              0.62
Key                                                  8.0
Loudness                                          -6.667
Mode                                                 1.0
Speechiness                                        0.254
Acousticness                                      0.0138
Instrumentalness                                     0.0
Liveness                                            0.15
Valence                                            0.478
Tempo                          

Notice that Can't hold us appears 870 times. It's mean date is 2007, when it's actual date is 2012. In the previous imputation (on the unreplicated data), the imputed date was 2001. This is a much better improvement. Similarly, 'wokeuplikethis*' has an imputed date of 2014, which is very close to the actual date of 2017. We can assume our imputation gives a good approximation of the actual dates. 

A way to check that our imputation is correct, would be to delete the dates that we know are correct (the ones we had from the beginning), impute them and compare them with the originals.

## Final dataset with all the imputed dates for all the playlists

Then use the code to remove replicated data to obtain the dataset with only unreplicated values.

In [None]:
df_merged.to_csv("playlists_with_dates.csv", index=False)