# Cleaning

## Imports

In [14]:
import pandas as pd
import json
import sys
import os

sys.path.append(os.path.abspath('../src'))

import api.spotify.search as searcher

In [16]:
info = searcher.search_isrc('QZ7XS2400012')
if info:
    print(json.dumps(info, indent=2)) 

2025-01-31 23:43:33,049 - INFO - Refreshing Spotify access token...
2025-01-31 23:43:33,128 - INFO - Access token refreshed successfully!


{
  "Track": "CARNIVAL",
  "Album Name": "VULTURES 1",
  "Artist": "\u00a5$"
}


In [39]:
base = pd.read_csv('./data/Most_Streamed_Spotify_Songs_2024_utf8.csv')

## Corrupted Characters

The creator of the CSV dataset improperly encoded the data when uploading, resulting in many corrupted characters, e.g.,

1. `Track` on Line **89**: Titï¿½ï¿½ Me Pregu
2. `Album Name` on Line **28**: ýýýýýýýýý ýýýýýý ýýýýýýýýýýýý

After unsuccessfully trying various decoding methods, I turned to Spotify's API to fix the corrupted text. Each track in the dataset has an ISRC (International Standard Recording Code), and Spotify provides ISRC metadata for most tracks. By matching ISRCs, I retrieve the correct metadata and replace the corrupted text with properly encoded UTF-8 values.

In [42]:
base_corrupt = base[['Track', 'Album Name', 'Artist', 'ISRC']]

In [43]:
base_corrupt.head(10)

Unnamed: 0,Track,Album Name,Artist,ISRC
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,QM24S2402528
1,Not Like Us,Not Like Us,Kendrick Lamar,USUG12400910
2,i like the way you kiss me,I like the way you kiss me,Artemas,QZJ842400387
3,Flowers,Flowers - Single,Miley Cyrus,USSM12209777
4,Houdini,Houdini,Eminem,USUG12403398
5,Lovin On Me,Lovin On Me,Jack Harlow,USAT22311371
6,Beautiful Things,Beautiful Things,Benson Boone,USWB12307016
7,Gata Only,Gata Only,FloyyMenor,QZL382406049
8,Danza Kuduro - Cover,ýýýýýýýýýýýýýýýýýýýýý - ýýýýýýýýýýýýýýýýýý -,MUSIC LAB JPN,TCJPA2463708
9,BAND4BAND (feat. Lil Baby),BAND4BAND (feat. Lil Baby),Central Cee,USSM12404354
