# Handling duplicates 

**Learning Objectives**

This tutorial explain how to handle duplicates using differents strategies:
* Remove exact duplicates
* Detecting Near-Duplicates Using Fuzzy Matching
* Detecting Near-Duplicates Using Cosine Similarity (for textual data)


In [1]:
import pandas as pd
df = pd.read_csv("../data/spotify_track_dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [3]:
df.columns

Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

In [4]:
# Detecting exact duplicates in track name and artist
duplicates = df[df.duplicated(subset=['track_name', 'artists'])]
print("Exact Duplicates:\n", duplicates)

# Dropping duplicates
df_cleaned = df.drop_duplicates(subset=['track_name', 'artists'])
print("\nCleaned DataFrame (no duplicates):\n", df_cleaned.head())

Exact Duplicates:
         Unnamed: 0                track_id  \
18              18  2qLMf6TuEC3ruGJg4SMMN6   
20              20  3S0OXQeoh0w6AY8WQVckRW   
22              22  5TvE3pk05pyFIGdSY9j4DJ   
28              28  5QAMZTM5cmLg3fHX9ZbTZi   
29              29  2qESE1ZeWly7I3YjyTXmXh   
...            ...                     ...   
113845      113845  5oyYmgnwGZ74992OLfYD2f   
113882      113882  7lYdF3SC4SCJPg5kROvXWx   
113917      113917  4r0ETFFJMBSQ0Z3ntuMDP2   
113951      113951  54o7m2sWPTagySKiaPPpT2   
113991      113991  0CE0Y6GM75cbrqao8EOAlW   

                                     artists  \
18                 Jason Mraz;Colbie Caillat   
20                                Jason Mraz   
22      A Great Big World;Christina Aguilera   
28                                Jason Mraz   
29                                Jason Mraz   
...                                      ...   
113845    Hillsong Worship;Brooke Ligertwood   
113882                 Bryan & Katie Torwalt

In [5]:
from fuzzywuzzy import fuzz, process

# Example: Detect near-duplicates for the track 'Shape of You'
track_to_check = 'Shape of You'
similar_tracks = process.extract(track_to_check, df['track_name'].tolist(), scorer=fuzz.token_sort_ratio, limit=5)

print(f"\nNear-Duplicates for '{track_to_check}':\n", similar_tracks)





Near-Duplicates for 'Shape of You':
 [('Shape Of You', 100), ('Shape of You', 100), ('Shape of You - Rock', 83), ('Spit Of You', 78), ('Spit Of You', 78)]


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = df.dropna(subset = ['track_name'])
# Example: Use TF-IDF for 'track_name' column
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['track_name'])

# Compute cosine similarity matrix for the first 10 songs
cosine_sim = cosine_similarity(tfidf_matrix[:10])

print("\nCosine Similarity Matrix (First 10 Songs):\n", cosine_sim)



Cosine Similarity Matrix (First 10 Songs):
 [[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


In [14]:
threshold = 0.8  # Set a threshold for near-duplicate detection
near_duplicate_pairs = []

# Iterate through cosine similarity scores and collect near-duplicates
for i in range(len(cosine_sim)):
    for j in range(i + 1, len(cosine_sim)):
        if cosine_sim[i][j] >= threshold:
            near_duplicate_pairs.append((df['track_name'].iloc[i], df['track_name'].iloc[j], cosine_sim[i][j]))

print("\nNear-Duplicate Track Pairs based on Cosine Similarity:\n", near_duplicate_pairs)

# Step 5: Conclusion
print("\nTutorial Completed: We detected exact and near-duplicates using both fuzzy matching and cosine similarity.")


Near-Duplicate Track Pairs based on Cosine Similarity:
 []

Tutorial Completed: We detected exact and near-duplicates using both fuzzy matching and cosine similarity.
