https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

In [116]:
import sys
import os
import pandas as pd
import numpy as np

sys.path.append(os.path.abspath('../'))

## Data load

In [117]:
# Path to the Spotify's dataset in the project directory
csv_file = '../data/external/spotify_dataset.csv'

df = pd.read_csv(csv_file)

df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


## Data Overview an Descriptive Statistics

### Overview

The number of observations and features are obtained through Panda's `.shape` method. The "Spotify" dataset contains **114.000 observations (rows)** and **21 features (columns)**.

In [118]:
df.shape

(114000, 21)

The data types are obtained through Panda's `.dtypes` method. The Dataframe contains only 1 boolean feature, 5 object type features, 6 int64 type features and  9 float64 type features.


In [119]:
def columns_per_dtype(df):
    """
    This function returns a dictionary where keys are data types (dtypes)
    and values are lists of column names corresponding to each dtype.

    Parameters:
        df (pd.DataFrame): The DataFrame to analyze.

    Returns:
        dict: A dictionary mapping each dtype to a list of column names with that dtype.
    """
    result = {}
    for col in df.columns:
        col_dtype = df[col].dtype
        if col_dtype not in result:
            result[col_dtype] = []
        result[col_dtype].append(col)
    
    print("Columns grouped by data types:")
    for dtype, columns in result.items():
        print(f"\nData Type: {dtype}")
        print("Columns:")
        for col in columns:
            print(f"  - {col}")


columns_per_dtype(df)

Columns grouped by data types:

Data Type: int64
Columns:
  - Unnamed: 0
  - popularity
  - duration_ms
  - key
  - mode
  - time_signature

Data Type: object
Columns:
  - track_id
  - artists
  - album_name
  - track_name
  - track_genre

Data Type: bool
Columns:
  - explicit

Data Type: float64
Columns:
  - danceability
  - energy
  - loudness
  - speechiness
  - acousticness
  - instrumentalness
  - liveness
  - valence
  - tempo


`Unnamed: 0` column is dropped as it is not part of the original dataset.

In [120]:
df =  df.drop(columns=["Unnamed: 0"])

The duplicated rows are obtained through Panda's `.duplicated` method. The Dataframe has 450 duplicate rows.

In [121]:
df[df.duplicated()].shape[0]

450

The missing values per feature are obtained through Panda's `.isnull().sum()` method. Only the features "artists", "album_name" and "track_name" have missing values, one missing value for each feature. This features, as indicated by their object datatype, are qualitative variables.

In [122]:
df.isnull().sum()

track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

Rows with null values are filtered  using the `.isnull()` method combined with `.any(axis=1)`. Only one row contains the missing values.

In [123]:
filtered_rows = df[df.isnull().any(axis=1)]

filtered_rows

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
65900,1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,7,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


The percentage of missing data is aproximmately 0.0001%, which in itself is not very significative.

In [124]:
round(df.isnull().sum().sum() / df.size * 100, 4) 

np.float64(0.0001)

### Descriptive statistics

#### Quantitative variables

Descriptive statistics of quantitative are generated through Panda's `.describe` method.



In [125]:
df.describe()

Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0
mean,33.238535,228029.2,0.5668,0.641383,5.30914,-8.25896,0.637553,0.084652,0.31491,0.15605,0.213553,0.474068,122.147837,3.904035
std,22.305078,107297.7,0.173542,0.251529,3.559987,5.029337,0.480709,0.105732,0.332523,0.309555,0.190378,0.259261,29.978197,0.432621
min,0.0,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,17.0,174066.0,0.456,0.472,2.0,-10.013,0.0,0.0359,0.0169,0.0,0.098,0.26,99.21875,4.0
50%,35.0,212906.0,0.58,0.685,5.0,-7.004,1.0,0.0489,0.169,4.2e-05,0.132,0.464,122.017,4.0
75%,50.0,261506.0,0.695,0.854,8.0,-5.003,1.0,0.0845,0.598,0.049,0.273,0.683,140.071,4.0
max,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


#### Qualitative variables

Panda's `.describe` method is used with the parameter `include='object'` for describing all qualitative columns of the DataFrame.

In [126]:
df.describe(include='object') 

Unnamed: 0,track_id,artists,album_name,track_name,track_genre
count,114000,113999,113999,113999,114000
unique,89741,31437,46589,73608,114
top,6S3JlDAGk3uu3NtZbPnuhS,The Beatles,Alternative Christmas 2022,Run Rudolph Run,acoustic
freq,9,279,195,151,1000


1. **Observation Count**: The column `track_id` matches the total number of observations (114,000), which indicates every row has a `track_id` entry.

2. **Uniqueness**: Despite having 114,000 observations, `track_id` contains only 89,741 unique values. This suggests that some `track_id`s are repeated across multiple rows.

4. **track_genre**: This column seems to have very few unique values (only 114), meaning it's highly categorical or repetitive.

5. **Frequent Entries**: The `top` values show the most frequent entry for each column, and `freq` gives the count of that value.  "The Beatles" is the mode in `artists`; "Alternative Christmas 2022"	is the mode in `album_name`; and "Run Rudolph Run" is the mode in `track_name`; and acoustic is the mode in `track_genre`.

## Handling missing values

As stablished previously, only one rows contains missing values, amounting to approximmately 0.0001%, of the data which in itself is not very significative.So this missing data is going to be dropped.

In [127]:
df = df.dropna()

Now the Dataframe has 113.999 rows and 20 columns.

In [128]:
df.shape

(113999, 20)

## Handling duplicated values

As stated previously the Dataframe has 450 duplicate rows.There are going to be dropped.

In [129]:
df = df.drop_duplicates()

In [130]:
print(f"The Dataframe without duplicates has {df.shape[0]} rows and {df.shape[1]} columns")

The Dataframe without duplicates has 113549 rows and 20 columns


### Inspecting the duplicated entries in `track_id`

We create a Dataframe where containing only the rows where the  column has duplicates and check its dimensions to assert the number of duplicated rows.

In [131]:
duplicates_id = df[df.duplicated(subset=['track_id'], keep=False)]
print(f"The Dataframe with duplicates has {duplicates_id.shape[0]} rows.")

The Dataframe with duplicates has 40108 rows.


Now we compute the difference in the number of rows and columns between our original DataFrame (`df`) and the filtered DataFrame of duplicates (`duplicates_id`). 

In [132]:
difference = (df.shape[0] - duplicates_id.shape[0])

# Compute the percentage of non-duplicated rows
percentage_non_duplicated = (difference / df.shape[0]) * 100

# Print the result
print(f"The number of non-duplicated rows is {difference}, which is {percentage_non_duplicated:.2f}% of the original Spotify DataFrame.")

The number of non-duplicated rows is 73441, which is 64.68% of the original Spotify DataFrame.


We need to check if pairs with duplicated `track_id` along with their respective `track_name` have a correspondence. To ensure that all pairs of duplicated `track_id`s have the same `track_name`, the data is grouped by `track_id` and we check if each group has only one unique `track_name`.

1. **`groupby('track_id')`**: Groups the DataFrame by `track_id`.
2. **`nunique()`**: Counts the number of unique `track_name` values in each group.
3. **Check for inconsistencies**: Identifies `track_id`s where there is more than one unique `track_name`.

In [133]:
grouped = df.groupby('track_id')['track_name'].nunique()

# Check for track_ids with more than one unique track_name
inconsistent = grouped[grouped > 1]

if inconsistent.empty:
    print("All duplicated track_ids have the same track_name.")
else:
    print("Some duplicated track_ids have inconsistent track_names.")
    print(inconsistent)


All duplicated track_ids have the same track_name.


As all duplicated `track_id`s have the same `track_name`s, we need to further inspect if there is anything that differentiates this duplicate tracks. 

In [134]:
# Group by 'track_id' and check for identical rows within each group
identical_groups = duplicates_id.groupby('track_id').filter(
    lambda group: group.drop_duplicates().shape[0] == 1
)

if identical_groups.empty:
    print("No duplicates are fully identical across all fields.")
else:
    print("Fully identical rows:")
    print(identical_groups.shape[0])

No duplicates are fully identical across all fields.


In [135]:
def check_inconsistencies(df, id_col='track_id'):
    """Versión con mensajes de diagnóstico"""
    dup_mask = df.duplicated(subset=id_col, keep=False)
    print(f"Total registros duplicados: {dup_mask.sum()}")
    
    if not dup_mask.any():
        print("No hay duplicados para analizar")
        return pd.DataFrame(columns=[id_col, 'inconsistent_columns'])
    
    results = []
    for track_id in df.loc[dup_mask, id_col].unique():
        group = df[df[id_col] == track_id]
        inconsistent = [col for col in group.columns 
                       if col != id_col and group[col].nunique() > 1]
        if inconsistent:
            results.append({
                id_col: track_id,
                'inconsistent_columns': ', '.join(inconsistent),
                'n_duplicates': len(group),
                'example_values': str(group[inconsistent[0]].unique()[:3])  # Muestra primeros valores
            })
    
    if not results:
        print("Duplicados encontrados pero son completamente consistentes en todas las columnas")
    
    return pd.DataFrame(results)

In [136]:
inconsistencies = check_inconsistencies(duplicates_id)
if inconsistencies.empty:
    print("No se encontraron inconsistencias en duplicados")
else:
    print("Inconsistencias encontradas:")
    display(inconsistencies)

Total registros duplicados: 40108


Inconsistencias encontradas:


Unnamed: 0,track_id,inconsistent_columns,n_duplicates,example_values
0,5SuOikwiRyPMVoIQDJUgSV,track_genre,4,['acoustic' 'j-pop' 'singer-songwriter']
1,4qPNDBW1i3p13qLCt0Ki3A,track_genre,2,['acoustic' 'chill']
2,01MVOl9KtVTNfFiBU9I7dc,track_genre,2,['acoustic' 'indie-pop']
3,6Vc5wAMmXdKIAM7WUoEb7N,track_genre,2,['acoustic' 'piano']
4,1EzrEOXmMH3G43AXT1y7pA,track_genre,2,['acoustic' 'rock']
...,...,...,...,...
16294,79cxnmnGiC0qZfxi5ogp4j,track_genre,2,['techno' 'trance']
16295,1B0FEDRzzN5GP7HGZZfNQl,track_genre,2,['techno' 'trance']
16296,4D41idYLHmXYGaHZeRWtPT,track_genre,2,['techno' 'trip-hop']
16297,27nGU2v3syK7aU3AVY2vUO,track_genre,2,['techno' 'trance']


In [137]:
def handle_duplicates_simple(df, id_col='track_id', strategy='first'):
    """
    Maneja duplicados de forma eficiente con estrategias básicas.
    
    Args:
        df: DataFrame de pandas
        id_col: Columna para identificar duplicados (default 'track_id')
        strategy: 'first' (keep first), 'last' (keep last), 
                 'mean' (promedio numérico/moda categórica),
                 'concat' (une valores únicos con '|')
    
    Returns:
        DataFrame procesado sin duplicados
    """
    # Verificación rápida de duplicados
    dup_mask = df.duplicated(subset=id_col, keep=False)
    
    if not dup_mask.any():
        return df.copy()
    
    # Estrategias simples
    if strategy in ['first', 'last']:
        return df.drop_duplicates(subset=id_col, keep=strategy)
    
    # Estrategias que requieren agrupación
    grouped = df.groupby(id_col)
    
    if strategy == 'mean':
        return grouped.agg(lambda x: x.mean() if np.issubdtype(x.dtype, np.number) 
                          else x.mode()[0] if not x.mode().empty else x.iloc[0]).reset_index()
    
    if strategy == 'concat':
        return grouped.agg(lambda x: '|'.join(map(str, x.unique()))).reset_index()
    
    raise ValueError(f"Estrategia no válida: {strategy}. Usar 'first', 'last', 'mean' o 'concat'")

### Checking categorical values in `track_genre`

In [138]:
np.sort(df.track_genre.unique())

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie', 'indie-pop', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop', 'pop-film',
       'pow