Key points from preprocess.ipynb:

1. **Dataset Loading**: The dataset `Music Info.csv` is loaded into a pandas DataFrame (`df_music`).

2. **Exploratory Data Analysis (EDA)**:
   - Displayed the first few rows and dataset structure.
   - Calculated summary statistics for numeric columns.
   - Identified unique values in `genre` and `tags`.
   - Checked for missing values in all columns.

3. **Handling Missing Values**:
   - Replaced missing values in `genre` with `"unknown"`.
   - Replaced missing values in `tags` with `"no_tags"`.

4. **Removing Duplicates**: Removed duplicate rows based on the `track_id` column.

5. **Feature Selection**:
   - Selected numeric features: `danceability`, `energy`, `loudness`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, `duration_ms`.
   - Selected categorical features: `genre`, `tags`, `key`, `mode`, and `year`.

6. **Scaling**: Scaled numeric columns (`tempo`, `loudness`, `duration_ms`, `year`) using `MinMaxScaler`.

7. **One-Hot Encoding**: Applied one-hot encoding to the `genre` column.

8. **Processing Tags**: Converted the `tags` column (comma-separated strings) into lists of tags.

9. **Saving Processed Data**: Saved the processed DataFrame to `Processed_Music_Info.csv`.

10. **Final Checks**:
    - Verified the shape and data types of the processed dataset.
    - Ensured numeric columns were scaled and categorical columns were encoded.

In [1]:
import pandas as pd

df_music = pd.read_csv('data/Music Info.csv')

print(df_music.head())
print(df_music.info())

             track_id             name           artist  \
0  TRIOREW128F424EAF0   Mr. Brightside      The Killers   
1  TRRIVDJ128F429B0E8       Wonderwall            Oasis   
2  TROUVHL128F426C441  Come as You Are          Nirvana   
3  TRUEIND128F93038C4      Take Me Out  Franz Ferdinand   
4  TRLNZBD128F935E4D8            Creep        Radiohead   

                                 spotify_preview_url              spotify_id  \
0  https://p.scdn.co/mp3-preview/4d26180e6961fd46...  09ZQ5TmUG8TSL56n0knqrj   
1  https://p.scdn.co/mp3-preview/d012e536916c927b...  06UfBBDISthj1ZJAtX4xjj   
2  https://p.scdn.co/mp3-preview/a1c11bb1cb231031...  0keNu0t0tqsWtExGM3nT1D   
3  https://p.scdn.co/mp3-preview/399c401370438be4...  0ancVQ9wEcHVd0RrGICTE4   
4  https://p.scdn.co/mp3-preview/e7eb60e9466bc3a2...  01QoK9DA7VTeTSE3MNzp4I   

                                                tags genre  year  duration_ms  \
0  rock, alternative, indie, alternative_rock, in...   NaN  2004       222200   
1 

In [2]:
# Summary statistics
print(df_music.describe(include=[float, int]))

# Unique values in categorical columns
print("Unique genres:", df_music["genre"].nunique(dropna=True))
print("Unique tags:", df_music["tags"].nunique(dropna=True))

               year   duration_ms  danceability        energy           key  \
count  50683.000000  5.068300e+04  50683.000000  50683.000000  50683.000000   
mean    2004.017323  2.511551e+05      0.493537      0.686486      5.312748   
std        8.860172  1.075860e+05      0.178838      0.251808      3.568078   
min     1900.000000  1.439000e+03      0.000000      0.000000      0.000000   
25%     2001.000000  1.927330e+05      0.364000      0.514000      2.000000   
50%     2006.000000  2.349330e+05      0.497000      0.744000      5.000000   
75%     2009.000000  2.881930e+05      0.621000      0.905000      9.000000   
max     2022.000000  3.816373e+06      0.986000      1.000000     11.000000   

           loudness          mode   speechiness  acousticness  \
count  50683.000000  50683.000000  50683.000000  50683.000000   
mean      -8.291204      0.631060      0.076023      0.213808   
std        4.548365      0.482522      0.076007      0.302848   
min      -60.000000      0.0

In [3]:
missing_counts = df_music.isnull().sum()
print("Missing values per column:\n", missing_counts)

Missing values per column:
 track_id                   0
name                       0
artist                     0
spotify_preview_url        0
spotify_id                 0
tags                    1127
genre                  28335
year                       0
duration_ms                0
danceability               0
energy                     0
key                        0
loudness                   0
mode                       0
speechiness                0
acousticness               0
instrumentalness           0
liveness                   0
valence                    0
tempo                      0
time_signature             0
dtype: int64


In [4]:
df_music["genre"].fillna("unknown", inplace=True)
df_music["tags"].fillna("no_tags", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_music["genre"].fillna("unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_music["tags"].fillna("no_tags", inplace=True)


In [5]:
# Remove duplicates based on 'track id'
duplicates = df_music.duplicated(subset=["track_id"])
df_music = df_music[~duplicates]

This ensures that every track is unique wrt its ID

tags and genre have missing values

## Selecting Relevant Columns

Numeric features (useful ones): danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms
Categorical Features: genre, tags, key, mode, and (maybe) year

In [14]:
model_columns = [
    "danceability", "energy", "loudness",
    "speechiness", "acousticness", "instrumentalness",
    "liveness", "valence", "tempo",
    "duration_ms", "genre", "tags", "key", "mode", "year"
]

df_model = df_music[model_columns].copy()

## Scaling

In [15]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
numeric_cols = ["tempo", "loudness", "duration_ms", "year"]

df_model[numeric_cols] = scaler.fit_transform(df_model[numeric_cols])

## Since we have 15 unique genres in this dataset, one-hot encoding can be done

In [16]:
df_model = pd.get_dummies(df_model, columns=["genre"], prefix="genre")
print(df_model.head())

   danceability  energy  loudness  speechiness  acousticness  \
0         0.355   0.918  0.874265       0.0746      0.001190   
1         0.409   0.892  0.874061       0.0336      0.000807   
2         0.508   0.826  0.851906       0.0400      0.000175   
3         0.279   0.664  0.803699       0.0371      0.000389   
4         0.515   0.430  0.786666       0.0369      0.010200   

   instrumentalness  liveness  valence     tempo  duration_ms  ...  \
0          0.000000    0.0971    0.240  0.619996     0.057868  ...   
1          0.000000    0.2070    0.651  0.730137     0.067412  ...   
2          0.000459    0.0878    0.543  0.502363     0.057008  ...   
3          0.000655    0.1330    0.490  0.437682     0.061754  ...   
4          0.000141    0.1290    0.104  0.384441     0.062177  ...   

  genre_Metal  genre_New Age  genre_Pop  genre_Punk  genre_Rap  genre_Reggae  \
0       False          False      False       False      False         False   
1       False          False      

## Now since the tags are comma-separated string, we convert them to a list of tags

In [17]:
df_model["tags"] = df_model["tags"].apply(lambda x: x.split(", ") if x!="np_tags" else [])

In [19]:
print(df_model["tags"].head())

0    [rock, alternative, indie, alternative_rock, i...
1    [rock, alternative, indie, pop, alternative_ro...
2    [rock, alternative, alternative_rock, 90s, gru...
3    [rock, alternative, indie, alternative_rock, i...
4    [rock, alternative, indie, alternative_rock, i...
Name: tags, dtype: object


Here we will store the tags as strings and handle them later in model using a textual embedding approach

In [22]:
print(df_model.shape) 

# Make sure columns are numeric
print(df_model.dtypes)

(50683, 30)
danceability        float64
energy              float64
loudness            float64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms         float64
tags                 object
key                   int64
mode                  int64
year                float64
genre_Blues            bool
genre_Country          bool
genre_Electronic       bool
genre_Folk             bool
genre_Jazz             bool
genre_Latin            bool
genre_Metal            bool
genre_New Age          bool
genre_Pop              bool
genre_Punk             bool
genre_Rap              bool
genre_Reggae           bool
genre_RnB              bool
genre_Rock             bool
genre_World            bool
genre_unknown          bool
dtype: object


In [23]:
df_model.to_csv("data_processed/Processed_Music_Info.csv", index=False)